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ABSTRACT 

Current computer architecture has moved towards the multi/many-core structure. 
However, the algorithms in the current sequential dense numerical linear algebra li- 
braries (e.g. LAPACK) do not parallelize well on multi/many-core architectures. A 
new family of algorithms, the tile algorithms, has recently been introduced to circum- 
vent this problem. Previous research has shown that it is possible to write efficient 
and scalable tile algorithms for performing a Cholesky factorization, a (pseudo) LU 
factorization, and a QR factorization. The goal of this thesis is to study tiled al- 
gorithms in a multi/many-core setting and to provide new algorithms that exploit 
the current architecture to improve performance relative to current state-of-the-art 
libraries while maintaining the stability and robustness of these libraries. 

In Chapter [2j we confront the problem of computing the inverse of a symmetric 
positive definite matrix with tiled algorithms. We observe that, using a dynamic 
task scheduler, it is relatively painless to translate existing LAPACK code to ob- 
tain a ready-to-be-executed tile algorithm. However we demonstrate that nontrivial 
compiler techniques (array renaming, loop reversal and pipelining) need to be ap- 
plied to further increase the parallelism of our application, both theoretically and 
experimentally. 

Chapter [3] revisits existing algorithms for the QR factorization of rectangular 
matrices composed of p x q tiles, where p > q, for an unlimited number of processors. 
Within this framework, we study the critical paths and performance of algorithms 
such as Sameh-Kuck, Fibonacci, Greedy, and those found within PLASMA. We 
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also provide a monotonically increasing function to transform the elimination list of a 
coarse-grain algorithm to a tiled algorithm. Although the optimality from the coarse- 
grain Greedy algorithm does not translate to the tiled algorithms, we propose a new 
algorithm and show that it is optimal in the tiled context. 

In Chapters [2] and [3j our context includes an unbounded number of processors. 
The exercise was to find algorithmic variants with short critical paths. Since the 
number of resources is unlimited, any task is executed as soon as all its dependencies 
are satisfied. In Chapters [4] and |5j we set ourselves in the more realistic context 
of bounded number of processors. In this context, at a given time, the number 
of ready-to-go tasks can exceed the number of available resources, and therefore a 
schedule which prescribes which tasks to execute when needs to be defined. For 
the Cholesky factorization, we study standard schedules and find that the critical 
path schedule is best. We also derive a lower bound on the time to solution of the 
optimal schedule. We conclude that the critical path schedule is nearly optimal for 
our study. For the QR factorization problem, we study the problem of optimizing 
the reduction trees (therefore the algorithm) and the schedule simultaneously. This 
is considerably harder than the Cholesky factorization where the algorithm is fixed 
and so, for Cholesky factorization, we are concerned only with schedules. We provide 
a lower bound for the time to solution for any tiled QR algorithm and any schedule. 
We also show that, in some cases, the optimal algorithm for an unbounded number of 
processors (found in Chapter [3]) cannot be scheduled to solve optimally the combined 
problem. We compare our algorithms and schedules with our lower bound. 

Finally, in Chapter [6] we study a recursive tiled algorithm in the context of matrix 
multiplication using the Winograd-Strassen algorithm using a dynamic task sched- 
uler. Whereas most implementations obtain either one or two levels of recursion, our 
implementation supports any level of recursion. 
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1. Introduction 

The High-Performance Computing (HPC) landscape is trending more towards a 
multi/many-core architecture [S] as is evidenced by the recent projects of major chip 
manufacturers and reports of surveys conducted by consulting companies [52] • The 
computational algorithms for dense linear algebra need to be re-examined to make 
better use of these architectures and provide higher levels of efficiency Some of these 
algorithms may have a straight forward translation from the current state-of-the-art 
libraries while others require much more thought and effort to gain performance in- 
creases. In this thesis, we endeavor to achieve algorithms that exploit the architecture 
to improve performance relative to current state-of-the-art computational libraries 
while maintaining the stability and robustness of these libraries. 

Our research will make use of the BLAS (Basic Linear Algebra Subprograms) [15] 
and the LAPACK (Linear Algebra PACKage) [7] libraries. The BLAS are a stan- 
dard to perform basic linear algebra operations involving vectors and matrices while 
LAPACK performs the more complicated and higher level linear algebra operations. 

The BLAS divide numerical linear algebra operations into three distinct group- 
ings based upon the amount of input data and the computational cost. Operations 
involving only vectors are considered Level 1; those involving both vectors and ma- 
trices are Level 2; and those involving only matrices are Level 3. For matrices of size 
n x n, the Level 3 operations are most coveted since they use 0(n 2 ) data for 0(n 3 ) 
computations and inherently reduce the amount of memory traffic. Since the BLAS 
provide fundamental linear algebra operations, hardware and software vendors such 
as Intel, AMD, and IBM provide optimized BLAS libraries for a variety of archi- 
tectures. The BLAS library can be multithreaded to make use of multi/many-core 
architectures and most of the vendor supplied libraries are multithreaded. 

The LAPACK library is a collection of subroutines for solving most of the common 
problems in numerical linear algebra and was developed to make use of Level 3 BLAS 
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operations as much as possible. Algorithms within LAPACK are written to make use 
of panels which can be either a block of columns or a block of rows so that updates 
are performed using matrix multiplications instead of vector operations. LAPACK 
can make use of a multithreaded BLAS to exploit multi/many-core architectures but 
this may not be enough to fully exploit the capability of the architecture. 

As an example, let us consider the Cholesky decomposition to f actor ize a symmet- 
ric positive definite (SPD) matrix into its triangular factor. There are three variants 
for performing the Cholesky decomposition: bordered, left-looking, and right-looking. 
All three work on either the upper or lower triangular portion of the matrix and 
produce the same triangular factor, but depending on the usage, one may have an 



advantage over the others. In Figure 1.1 we depict the steps involved in each variant 
using the lower triangular formulations. At the start of each variant, the matrix is 
subdivided into blocks of rows and columns, or panels, which will take advantage of 
the Level 3 BLAS. 



The 'bordered' variant, as depicted in Figure 1.1a involves a loop over three 
steps. The first step updates the purple row block using the already factorized green 
portion, the second step updates the next triangular block to be factorized (in red), 
and the third step performs the factorization of the triangular block. This is then 
repeated until the entire matrix is factorized. Note that the lower portion of the 
matrix is not touched by the preceding steps. 



The 'left-looking' variant (see Figure 1.1b) involves four steps. The first step 



updates the triangular block in red which is then factorized in the second step, the 
third step updates the block column (in cyan) below the triangular block using the 
previous columns, and the last step updates the column block using the factorization 
of step 2. It is called left-looking since the algorithm does not affect the portion to 
the right of the current block of the matrix and only 'looks' to the left for its updates. 
The top most triangular portion of the matrix is in its final form and will not change 



IK 



STEP 1 :: 



STEP 2 : 




a 



STEP 3 :: ^ < i CHOL( 

(a) Bordered variant 




STEP 1 : 

STEP 2 : 
STEP 3 : 

STEP 4 : 



^ CHOL( ^) 



□ 



(b) Left-looking variant 



D 









IK 



STEP 1 : 
STEP 2 : 

STEP 3 : 



kK 



CHOL( ^) 



(c) Right-looking variant 
Figure 1.1: Three variants for the Cholesky decomposition 
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in the suceeding steps of the algorithm. 



The 'right-looking' variant (see Figure 1.1c) involves three steps. It performs the 



factorization of the red triangular block, updates the block column (in cyna) and then 
updates the blue trailing matrix on the right. This is called right- looking since it does 
not require anything from the previous factorized matrix and pushes its updates to 
the right part of the matrix. The entire matrix to the left of the column block in 
which the algorithm is currently working is in its final form. 

The advantage of the bordered variant is that it does the least amount of oper- 
ations to determine if a matrix is SPD. The advantage of the right-looking variant 
is that it provides the most parallelism. A major disadvantage of the left-looking 
variant is the added fork-join that it must perform between the steps as compared to 
the other two variants which will negatively affect its parallel performance. 

The LAPACK scheme of using panels has three distinct disadvantages which 
limit its performance. The first can be seen in the third step of the right-looking 



Cholesky decomposition (Figure 1.1c) where potentially a large symmetric rank k 
operation is performed; the memory architecture will bound the performance of the 
algorithm. Secondly, there is some impact of the synchronizations that must be 
performed between each step. Third, the idea of panels does not allow for fine- 
grained tasks. We alleviate the last two of these restrictions through the use of tiled 
algorithms whereas the first is overcome through the use of Level 3 BLAS operations 
within the tiled algorithm. 

We approach this via tiling a matrix which means reordering the data of the 
matrix into smaller regions of contiguous memory as depicted in Figure 1.2| By 



varying the tile size, this data layout allows us to tune the algorithm such that the 
data needed for the kernels is present in the cache of the processor core. Moreover, 
we are able to increase the amount of parallelism and minimize the synchronizations. 
Let us revisit the Cholesky decomposition as described earlier and apply each 



Figure 1.2: Data layout of tiled matrix 



step to the tiled matrix. In Figure 1.3 we present the directed acyclic graphs (DAG) 
for the three variants on a tiled matrix of 4 x 4 tiles. In each of the DAGs, the tasks 
are represented as the nodes and the data dependencies are the edges. The dashed 
horizontal lines designate a full sweep through all of the steps in each algorithm. 




The first observation that one makes is that the height of each DAG varies ac- 
cording to which variant is chosen. The bordered variant is the tallest since the tasks 
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become almost sequential where the only portion that is not sequential is that of 
the row block update. The left-looking variant is almost of height t 2 where t is the 
number of tiles in a column of the tiled matrix. It gains parallelism from being able 
to update the column block of the final step within the loop in parallel. As before, 
the right-looking variant is the shortest and provides the most parallelism. 

However, in the tiled versions, the synchronizations between each step of the 
LAPACK algorithms is superficial and can be removed. By doing so, these three 
variants reduce to only the right-looking variant. 

The main difference between the tiled version and the blocked version is the 
amount of parallelism that is gained from updates of the trailing matrix. Instead of 
performing a large symmetric rank k update where k is the number of rows in a row 
block or the number of columns in a column block, the operation is decomposed into 
smaller symmetric rank n b updates and associated matrix multiplications where rib 
is the size of a tile such that N = t ■ rib- In the right-looking variant, for an N x N 
matrix the size of the first trailing matrix is (N — k) x (N — k) so that the update 
operation for this first matrix has a computational cost of 0(kN 2 ). The tiled update 
consists of both symmetric updates and matrix multiplications of tiles of size rib x nb 
so that the computational cost per task is 0{n\). 

The size of the tiles will determine the granularity of the tasks for a tiled algo- 
rithm. For a matrix of size N x N, the tile size of the tiled txt matrix can vary from 
the entire size of the matrix (t = 1) down to a scalar (t = N), but is held constant 
throughout the execution of the algorithm. However, a balance must be kept between 
the efficiency of the kernel and the amount of data movement. 

Therefore, a tiled algorithm does overcome the restriction on the granularity 
imposed by the panels concept of LAPACK as well as alleviate some of the synchro- 
nizations and associated overhead. The memory bound is still present due to needing 
to move the updates of the trailing matrices. 
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The Parallel Linear Algebra Software for Multicore Architectures (PLASMA) [2] 
library provides the framework for the tiled algorithms. For the experimental por- 
tions, our main assumption will be a shared memory machine architecture wherein 
each processor has direct access to all portions of the memory in which the matrices 
are stored. 

In Chapter [2] we describe more fully the implementation of the tiled Cholesky 
decomposition and further the tiled Cholesky Inversion algorithm. It shows that 
translating from LAPACK to PLASMA can be straight forward, but that there are 
caveats to be taken into account. In Chapter [3] we have implemented a tiled QR 
decomposition showing that the implementation is not straight-forward. Moreover, 
results from previous work do not translate directly to the tiled algorithm. 

Unlike Chapters [2] and [3] where an unbounded number of processors is assumed, 
Chapters [4] and [5] restrict the number of processors and provide bounds on the per- 
formance of the algorithms. We observe the theoretical speed-up and efficiency and 
provide more realistic bounds on the performance. 

Finally, Chapter [6] presents a study on a tiled implementation of the Strassen- 
Winograd algorithm for matrix-matrix multiplication. Unlike the other algorithms 
presented, it is a recursive algorithm which becomes interesting in the scope of tiled 
matrices. 
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2. Cholesky Inversion 

In this chapter, we present joint work with Emmanuel Agullo, Jack Dongarra, 
Jakub Kurzak, Julien Langou, and Lee Rosenberg [I]. 

The appropriate direct method to compute the solution of a symmetric positive 
definite system of linear equations consists of computing the Cholesky factorization 
of that matrix and then solving the underlying triangular systems. It is not recom- 
mended to use the inverse of the matrix in this case. However some applications need 
to explicitly form the inverse of the matrix. A canonical example is the computation 
of the variance-covariance matrix in statistics. Higham [321 p. 260, §3] lists more such 
applications. 

With their advent, multicore architectures [50J induce the need for algorithms and 
libraries that fully exploit their capacities. A class of such algorithms - called tile 
algorithms [TBI US] - has been developed for one-sided dense factorizations (Cholesky, 
LU and QR) and made available as part of the Parallel Linear Algebra Software for 
Multicore Architectures (PLASMA) library [2J. In this chapter, we extend this class 
of algorithms to the case of the (symmetric positive definite) matrix inversion. Besides 
constituting an important functionality for a library such as PLASMA, the study of 
the matrix inversion on multicore architectures represents a challenging algorithmic 
problem. Indeed, first, contrary to standalone one-sided factorizations that have been 
studied so far, the matrix inversion exhibits many anti-dependences [6] (Write after 
Read). This is a false or artificial dependency which is reliant on the name of the 
data and not the actual data flow. For example, given two operations where the 
first only reads the data in the matrix A and the second only writes to the location 
of A, then in a parallel execution there may be a case where the data being read 
by the first operation is wrong since the second may have already written to the 
location. By copying the data from A beforehand, both operations can be executed 
in parallel. Those anti-dependences can be a bottleneck for parallel processing, which 
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is critical on multicore architectures. It is thus essential to investigate (and adapt) 
well known techniques used in compilation such as using temporary copies of data to 
remove anti-dependences to enhance the degree of parallelism of the matrix inversion. 
This technique is known as array renaming [6] (or array privatization [28]). Second, 
loop reversal [6] is to be investigated. Third, the matrix inversion consists of three 
successive steps (first of which is the Cholesky decomposition) . In terms of scheduling, 
it thus represents an opportunity to study the effects of pipelining [6] those steps on 
performance. 

The current version of PLASMA (version 2.1) is scheduled statically. Initially de- 
veloped for the IBM Cell processor |34j, this static scheduling relies on POSIX threads 
and simple synchronization mechanisms. It has been designed to maximize data reuse 
and load balancing between cores, allowing for very high performance [3] on today's 
multicore architectures. However, in the case of matrix inversion, the design of an 
ad-hoc static scheduling is a time consuming task and raises load balancing issues 
that are much more difficult to address than for a stand-alone Cholesky decomposi- 
tion, in particular when dealing with the pipelining of multiple steps. Furthermore, 
the growth of the number of cores and the more complex memory hierarchies make 
executions less deterministic. In this chapter, we rely on an experimental in-house 
dynamic scheduler [33] . This scheduler is based on the idea of expressing an algorithm 
through its sequential representation and unfolding it at runtime using data hazards 
(Read after Write, Write after Read, Write after Write) as constraints for parallel 
scheduling. The concept is rather old and has been validated by a few successful 
projects. We could have as well used schedulers from the Jade project from Stanford 
University [12] or from the SMPSs project from the Barcelona Supercomputer Center 

Our discussions are illustrated with experiments conducted on a dual-socket quad- 
core machine based on an Intel Xeon EMT64 processor operating at 2.26 GHz. The 
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theoretical peak is equal to 9.0 Gflop/s per core or 72.3 Gflop/s for the whole machine, 
composed of 8 cores. The machine is running Mac OS X 10.6.2 and is shipped with 
the Apple vecLib vl26.0 multithreaded BLAS [15] and LAPACK vendor library, as 
well as LAPACK [7J v3.2.1 and ScaLAPACK [13] vl.8.0 references. 
2.1 Tile in-place matrix inversion 

Tile algorithms are a class of Linear Algebra algorithms that allow for fine gran- 
ularity parallelism and asynchronous scheduling, enabling high performance on mul- 
ticore architectures [31 [181 EE SI]- The matrix of order n is split into t x t square 
submatrices of order b (n = b x i). Such a submatrix is of small granularity (we 
fixed b = 200 in this chapter) and is called a tile. So far, tile algorithms have been 
essentially used to implement one-sided factorizations 0, CIEJ EES SI] • 



Algorithm |2.1| extends this class of algorithms to the case of the matrix inver- 
sion. As in state-of-the-art libraries (LAPACK, ScaLAPACK), the matrix inversion 
is performed in-place, i.e., the data structure initially containing matrix A is directly 
updated as the algorithm is progressing, without using any significant temporary 



extra- storage; eventually, A' 1 replaces A. Algorithm 2.1 is composed of three steps. 
Step 1 is a Tile Cholesky Factorization computing the Cholesky factor L (lower tri- 
angular matrix satisfying A = LL T ). This step was studied in [15]. Step 2 computes 
L^ 1 by inverting L. Step 3 finally computes the inverse matrix A^ 1 = L~ lT L~ l . 
Each step is composed of multiple fine granularity tasks (since operating on tiles). 
These tasks are part of the BLAS (SYRK, GEMM, TRSM, TRMM) and LAPACK 
(POTRF, TRTRI, LAUUM) standards. A more detailed description is beyond the 
scope of this extended chapter and is not essential to the understanding of the rest 
of the chapter. Indeed, from a high level point of view, an operation based on tile 
algorithms can be represented as a Directed Acyclic Graphs (DAG) [22] where nodes 
represent the fine granularity tasks in which the operation can be decomposed and the 



edges represent the dependences among them. For instance, Figure 2.1a represents 
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Algorithm 2.1: Tile In-place Cholesky Inversion (lower format). Matrix A is 
the on-going updated matrix (in-place algorithm). 

Input: A, Symmetric Positive Definite matrix in tile storage (t x t tiles). 

Result: A^ 1 , stored in-place in A. 

1 Step 1: Tile Cholesky Factorization (compute L such that A = LL T ); 

2 for j ; = to t — 1 do 



3 
4 

5 
6 
7 

8 



9 
10 



for k — to j — 1 do 

L A jj <" - A j, k * A lk (SYRK(j,k)) ; 

A jd <- CHOL(Ajj) (POTRF(j)) ; 
for i — j + 1 to t — 1 do 
for k = to j > — 1 do 

L <~ - Ak * Aj tk (GEMM(i,j,k)) ; 



for i — j + 1 to t — 1 do 

[ A, <- Aij/Ajj (TRSM(ij)) ; 

n Step 2: Tile Triangular Inversion of L (compute L -1 ); 

12 for j ; — t — 1 to do 

is A jd <- TRINV(Ajj) (TRTRI(j)) ; 

14 for i — t — 1 to j + 1 do 

is Aij <- A M * Ajj (TRMM(iJ)) ; 

16 for fc = j + 1 to % — 1 do 

ir [ Aij <- Aj + A hk * A k j (GEMM(i,j,k)) ; 

is [_ A^ <- -Aj * Ai,i (TRMM(iJ)) ; 

19 Step 3: Tile Product of Lower Triangular Matrices (compute A^ 1 = L~ lT L~ x )\ 

20 for % = to t — 1 do 



21 

22 

23 
24 
25 
26 

27 
28 



for j 1 = to % — 1 do 

L A^ ^A T ^ A^ (TRMM(iJ)) ; 

Ai,i <- Af. * Ai,i (LAUUM(i)) ; 
for j 1 = to % — 1 do 

for = « + 1 to t — ldo 

A^ <- A^ + A T ki * A k j (GEMM(i,j,k)) ; 



for k = i + ltot — ldo 

L A i,i <- A hl + AT. * A k)i (SYRK(i,k)) ; 
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the DAG of Step 3 of Algorithm 



2.1 




(a) In-place (Algorithm 2.1 1 




(b) Out-of-place (variant introduced in § 2.2 ) 



Figure 2.1: DAGs of Step 3 of the Tile Cholesky Inversion (t — 4). 



Algorithm 2.1| is based on the variants used in LAPACK 3.2.1. Bientinesi, Gunter 
and van de Geijn pU] discuss the merits of algorithmic variations in the case of the 
computation of the inverse of a symmetric positive definite matrix. Although of 
definite interest, this is not the focus of this extended chapter. 



We have implemented Algorithm 2.1 using our dynamic scheduler introduced 



in the beginning of the chapter. Figure 2.2 shows its performance against state-of- 
the-art libraries and the vendor library on the machine described in the beginning 
of the chapter. For a matrix of small size, it is difficult to extract parallelism and 
have a full use of all the cores [31 HH [191 SI]- We indeed observe a limited scalabil- 



ity (N = 1000, Figure 2.2a). However, tile algorithms (Algorithm 2.1) still benefit 



from a higher degree of parallelism than blocked algorithms [31 HH EH SI]- There- 



fore Algorithm 2. 1| (in place) consistently achieves a significantly better performance 



than vecLib, ScaLAPACK and LAPACK libraries. A larger matrix size (N = 4000, 



Figure 2.2b) allows for a better use of parallelism. In this case, an optimized imple- 



mentation of a blocked algorithm (vecLib) competes well against tile algorithms (in 



place) on few cores (left part of Figure 2.2a). However, only tile algorithms scale to 



a larger number of cores (rightmost part of Figure 2.2b) thanks to a higher degree 

of parallelism. In other words, the tile Cholesky inversion achieves a better strong 
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Table 2.1: Length of the critical path as a function of the number of tiles t. 



Step 1 
Step 2 
Step 3 



In-place case Out-of-place case 



3t 
3t 
3t 



2 
3 
2 



3t - 2 
2t- 1 
t 



scalability than the blocked versions, similarly to what had been observed for the 
factorization step [3j [T8| [T9| [ITj . 



Cholesky Inversion (POTRF+POTRI) GUSTv2009.12.01 (run on (#thread+1) threads) 
N=1000, NB=200, VecLib, Processor 2 x 2.26GHz Quad-Core Intel Xeon 



Cholesky Inversion (POTRF+POTRI) GUSTv2009.12.01 (run on (#thread+1) threads) 
N=4000, NB=200, VecLib, Processor 2 x 2.26GHz Quad-Core Intel Xeon 



■ theoretical peak 
- perfect scalability 

■ out of place 
in place 

■ vecLib 

■ scalapack 
- lapack 





(a) n = 1000 
Figure 2.2: Scalability of Algorithm 



(b) n = 4000 

(in place) and its out-of-place variant in- 
troduced in § |2.2[ using our dynamic scheduler against vecLib, ScaLAPACK and 



2.1 



LAPACK libraries. 



2.2 Algorithmic study 



In the § 2.1, we compared the performance of the tile Cholesky inversion against 
state-the-art libraries. In this section, we focus on tile Cholesky inversion and we 



discuss the impact of several variants of Algorithm |2. 1| on performance. 

Array renaming (removing anti-dependences). The dependence between 



SYRK(0,1) and TRMM(1,0) in the DAG of Step 3 of Algorithm 2.1 (Figure 2.1a) 



represents the constraint that the SYRK operation (1. 28 of Algorithm 2.1) needs 
to read A^i = Ai.o before TRMM (1. 22) can overwrite Aij = A\<q. This anti- 
dependence (Write after Read) can be removed thanks to a temporary copy of Ax )0 . 
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Similarly, all the SYRK-TRMM anti-dependences, as well as TRMM-LAUMM and 
GEMM-TRMM anti-dependences can be removed. We have designed a variant of 



Algorithm 2.1 that removes all the anti-dependences thanks to the use of a large 
working array (this technique is called array renaming [6] in compilation [6]). The 



subsequent DAG (Figure 2.1b) is split in multiple pieces (Figure 2.1b), leading to a 



shorter critical path (Table 2.1). We implemented the out-of-place algorithm, based 



on our dynamic scheduler too. Figure 2.2a shows that our dynamic scheduler exploits 
its higher degree of parallelism to achieve a much higher strong scalability even on 



small matrices (N = 1000). For a larger matrix (Figure 2.2b), the in-place algorithm 
already achieved very good scalability. Therefore, using up to 7 cores, their perfor- 
mance are similar. However, there is not enough parallelism with a 4000 x 4000 matrix 
to use efficiently all 8 cores with the in-place algorithm; thus the higher performance 



of the out-of-place version in this case (leftmost part of Figure 2.2b). 



Loop reversal (exploiting commutativity). The most internal loop of each 



step of Algorithm 2.1 (1. 8, 1. 17 and 1. 26) consists in successive commutative GEMM 



operations. Therefore they can be performed in any order, among which increasing 
order and decreasing order of the loop index. Their ordering impacts the length of the 



critical path. Algorithm 2.1 orders those three loops in increasing (U) and decreasing 



(D) order, respectively. We had manually chosen these respective orders (UDU) 



because they minimize the critical path of each step (values reported in Table 2.1). 



A naive approach would have, for example, been comprised of consistently ordering 
the loops in increasing order (UUU). In this case (UUU), the critical path of TRTRI 
would have been equal to t 2 — 2t + 3 (in-place) or (|t 2 — |t + 2) (out-of-place) instead 



of 3t — 3 (in-place) or 2t — 1 (out-of-place) for (UDU). Figure 2.3 shows how loop 
reversal impacts performance. 

Pipelining. Pipelining the multiple steps of the inversion reduces the length of 
its critical path. For the in-place case, the critical path is reduced from 9t — 7 tasks 
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Cholesky Inversion (POTRF+POTRI) GUSTV2009. 12.01 (run on (#thread+1) threads) Cholesky Inversion (POTRF+POTRI) GUSTv2009. 12.01 (run on (#thread+1) threads) 



N=1 000, NB=200, VecLib, Processor 2 x 2.26GHz Quad-Core Intel Xeon 



- - - peak 

perfect scalability t 
outotplace(UDU) * 

- - - out of place (UUU) 

in place (UDU) 

- - - in place (UUU) 



N=4000, NB=200, VecLib, Processor 2 x 2.26GHz Quad-Core Intel Xeon 
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~ ~ ~ peak 


* jT 
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- - - out of place (UUU) 




- — in place (UDU) 




_ _ _ :,- rti^^rt i\ II II l\ 

in place (UUU) 


/ 
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* JF 

* s 

■ * . . 

* / / 


y 







3 4 5 

(fflhreads) 



3 4 5 6 

(fflhreads) 



(a) n = 1000 (b) n = 4000 

Figure 2.3: Impact of loop reversal on performance. 

(t is the number of tiles) to 9t — 9 tasks (negligible). For the out-of-place case, it 
is reduced from 6t — 3 to 5t — 2 tasks. We studied the effect of pipelining on the 
performance of the inversion on a 8000 x 8000 matrix with an artificially large tile 
size (b = 2000 and t = 4). As expected, we observed almost no effect on performance 
of the in-place case (about 36.4 seconds with or without pipelining). For the out-of- 
place case, the elapsed time grows from 25.1 to 29.2 seconds (16% overhead) when 
pipelining is prevented. 

2.3 Conclusion and future work 

We have proposed a new algorithm to compute the inverse of a symmetric positive 

definite matrix on multicore architectures. An experimental study has shown both 
an excellent scalability of our algorithm and a significant performance improvement 
compared to state-of-the-art libraries. Beyond extending the class of so-called tile 
algorithms, this study brought back to the fore well known issues in the domain of 
compilation. Indeed, we have shown the importance of loop reversal, array renaming 
and pipelining. 

The use of a dynamic scheduler allowed an out-of-the-box pipeline of the differ- 
ent steps whereas loop reversal and array renaming required a manual change to the 
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algorithm. The future work directions consist in enabling the scheduler to perform 
itself loop reversal and array renaming. We exploited the commutativity of GEMM 
operations to perform array renaming. Their associativity would furthermore allow 
to process them in parallel (following a binary tree); the subsequent impact on perfor- 
mance is to be studied. Array renaming requires extra-memory. It will be interesting 
to address the problem of the maximization of performance under memory constraint. 
This work aims to be incorporated into PLASMA. 
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3. QR Factorization 

In this chapter we present joint work with Mathias Jacquelin, Julien Langou, and 
Yves Robert [3T] . 

Given an m-by-n matrix A with n < m, we consider the computation of its QR 
factorization, which is the factorization A = QR, where Q is an m-by-n unitary 
matrix (Q H Q = I n ), and R is upper triangular. 

The QR factorization of an m-by-n matrix with n < m is the time consuming 
stage of some important numerical computations. It is needed for solving a linear 
least squares problem with m equations (observations) and n unknowns and is used 
to compute an orthonormal basis (the Q-factor) of the column span of the initial 
matrix A. For example, all block iterative methods (used to solve large sparse linear 
systems of equations or computing some relevant eigenvalues of such systems) require 
orthogonalizing a set of vectors at each step of the process. For these two usage 
examples, while n < m, n can range from n <C m to n = m. We note that the 
extreme case n = m is also relevant: the QR factorization of a matrix can be used to 
solve (square) linear systems of equations. While this requires twice as many flops as 
an LU factorization, using a QR factorization (a) is unconditionally stable (Gaussian 
elimination with partial pivoting or pairwise pivoting is not) and (b) avoids pivoting 
so it may well be faster in some cases. 

To obtain a QR factorization, we consider algorithms which apply a sequence of 
m-by-m unitary transformations, Ui, (U^Ui = I,), i = 1,...,£, on the left of the 
matrix A, such that after i transformations the resulting matrix R = Ut . . . U\A is 
upper triangular, in which case, R is indeed the i?-factor of the QR factorization. The 
Q-factor (if needed) can then be obtained by computing Q = U± . . . Uf . These types 
of algorithms are in regular use, e.g., in the LAPACK and ScaLAPACK libraries, 
and are favored over others algorithms (Cholesky QR or Gram-Schmidt) for their 
stability. 

17 



The unitary transformation [/, is chosen so as to introduce some zeros in the cur- 
rent update matrix C/j_i . . . U±A. The two basic transformations are Givens rotations 
and Householder reflections. One Givens rotation introduces one additional zero; 
the whole triangularization requires mn — n{n + l)/2 Givens rotations for n < m. 
One elementary Householder reflection simultaneously introduces m — i zeros in po- 
sition i + 1 to m in column i; the whole triangularization requires n Householder 
reflections for n < m. (See LAPACK subroutine GEQR2.) The LAPACK GEQRT 
subroutine constructs a compact WY representation to apply a sequence of % House- 
holder reflections, this enables one to introduce the appropriate zeros in % consecutive 
columns and thus leverage optimized Level 3 BLAS subroutines during the update. 
The blocking of Givens rotations is also possible but is more costly in terms of flops. 

The main interest of Givens rotations over Householder transformations is that 
one can concurrently introduce zeros using disjoint pairs of rows, in other words, two 
transformations Ui and U i+ i may be applicable concurrently. This is not possible 
using the original Householder reflection algorithm since the transformations work 
on whole columns and thus do not exhibit this type of intrinsic parallelism, forcing 
this kind of Householder reflections to be applied sequentially. The advantages of 
Householder reflections over Givens rotations are that, first, Householder reflections 
perform fewer flops, and second, the compact WY transformation enables high se- 
quential performance of the algorithm. In a multicore setting, where data locality 
and parallelism are crucial algorithmic characteristics for enabling performance, the 
tiled QR factorization algorithm combines both ideas: use of Householder reflections 
for high sequential performance and use of a scheme such as Givens rotations to en- 
able parallelism within cores. In essence, one can think either (i) of the tiled QR 
factorization as a Givens rotation scheme but on tiles {m b -by-n h submatrices) instead 
of on scalars (1-by-l submatrices) as in the original scheme, or (ii) of it as a blocked 
Householder reflection scheme where each reflection is confined to an extent much 
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less than the full column span, which enables concurrency with other reflections. 

Tiled QR factorization in the context of multicore architectures has been intro- 
duced in [HI [191 E] . Initially the focus was on square matrices and the sequence of 
unitary transformations presented was analogous to Sameh-Kuck [45], which cor- 
responds to reducing the panels with flat trees. The possibility of using any tree in 
order to either maximize parallelism or minimize communication is explained in [26] . 
The focus of this chapter is on maximizing parallelism. We reduce the communica- 
tion (data movement between memory hierarchy) within the algorithm to acceptable 
levels by tiling the operations. Stemming from the observation that a binary tree is 
best for tall and skinny matrices and a flat tree is best for square matrices, Hadri 
et al. [30], propose to use trees which combine flat trees at the bottom level with a 
binary tree at the top level in order to exhibit more parallelism. Our theoretical and 
experimental work explains that we can adapt Fibonacci [US] and Greedy [2H 125] 
to tiles, resulting in yet better algorithms in terms of parallelism. Moreover our new 
algorithms do not have any tuning parameter such as the domain size in the case 
of [30]. 

The sequential kernels of the Tiled QR factorization (executed on a core) are made 
of standard blocked algorithms such as LAPACK encoded in kernels; the development 
of these kernels is well understood. The focus of this chapter is on improving the 
overall degree of parallelism of the algorithm. Given a p-by-q tiled matrix, we seek 
to find an appropriate sequence of unitary transformations on the tiled matrix so as 
to maximize parallelism (minimize critical path length). We will get our inspiration 
from previous work from the 1970s/80s on Givens rotations where the question was 
somewhat related: given an m-by-n matrix, find an appropriate sequence of Givens 
rotations as to maximize parallelism. This question is essentially answered in [2H [251 
[MIES]; we call this class of algorithms " coarse-grain algorithms." 

Working with tiles instead of scalars, we introduce four essential differences be- 
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tween the analysis and the reality of the tiled algorithms versus the coarse-grain 
algorithms. First, while there are only two states for a scalar (nonzero or zero), a 
tile can be in three states (zero, triangle or full). Second, there are more operations 
available on tiles to introduce zeros; we have a total of three different tasks which can 
introduce zeros in a matrix. Third, in the tiled algorithms, the factorization and the 
update are dissociated to enable factorization stages to overlap with update stages; 
whereas, in the coarse-grain algorithm, the factorization and the associated update 
are considered as a single stage. Lastly, while coarse-grain algorithms have only one 
task, we end up with six different tasks: three from the factorizations (zeroing of 
tiles) and three for each of the associated updates (since these have been unlinked 
from the factorization). Each of these six tasks have different computational weights; 
this dramatically complicates the critical path analysis of the tiled algorithms. 

While the Greedy algorithm is optimal for coarse-grain algorithms, we show 
that it is not in the case of tiled algorithms. However, we have devised and proved 
that there does exist an optimal tiled algorithm. 

3.1 The QR factorization algorithm 

Tiled algorithms are expressed in terms of tile operations rather than elementary 

operations. Each tile is of size nj, x nb, where rif, is a parameter tuned to squeeze 
the most out of arithmetic units and memory hierarchy. Typically, n& ranges from 



80 to 200 on state-of-the-art machines |5j. Algorithm 3.1 outlines a naive tiled QR 
algorithm, where loop indices represent tiles: 

Algorithm 3.1: Generic QR algorithm for a tiled p x q matrix, 
l for k = 1 to min(p, q) do 



forall the i G {k + 1, . . . ,p} using any ordering, do 
^ elim(i,piv(i,k),k) 



In Algorithm 3.1, k is the panel index, and elim(i,piv(i,k),k) is an orthogonal 



transformation that combines rows i and piv (i, k) to zero out the tile in position (i, k). 
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However, this formulation is somewhat misleading, as there is much more freedom 
for QR factorization algorithms than, say, for Cholesky algorithms (and contrarily to 
LU elimination algorithms, there are no numerical stability issues). For instance in 
column 1, the algorithm must eliminate all tiles (z, 1) where % > 1, but it can do so 



in several ways. Take p = 6. Algorithm 34^ uses the transformations 

elim(2, 1, 1), elim(3, 1, 1), eHm(4, 1, 1), elim(5, 1, 1), elim(6, 1, 1) 

But the following scheme is also valid: 

elim(3, 1, 1), elim(6, 4, 1), elim(2, 1, 1), elim(5, 4, 1), elim(4, 1, 1) 

In this latter scheme, the first two transformations elim(3, 1, 1) and e/zm(6,4, 1) use 
distinct pairs of rows, and they can execute in parallel. On the contrary, elim(3, 1, 1) 
and elim(2,l,l) use the same pivot row and must be sequentialized. To compli- 
cate matters, it is possible to have two orthogonal transformations that execute in 
parallel but involve zeroing a tile in two different columns. For instance we can 
add e/im(6,5,2) to the previous transformations and run it concurrently with, say, 
elim(2, 1,1). Any tiled QR algorithm will be characterized by an elimination list, 
which provides the ordered list of the transformations used to zero out all the tiles 
below the diagonal. This elimination list must obey certain conditions so that the fac- 
torization is valid. For instance, elim(6, 5, 2) must follow elim(6, 4, 1) and elim(5, 4, 1) 
in the previous list, because there is a flow dependence between these transformations. 
Note that, although the elimination list is given as a totally ordered sequence, some 
transformations can execute in parallel, provided that they are not linked by a de- 
pendence: in the example, e/zm(6,4, 1) and elim(2, 1,1) could have been swapped, 
and the elimination list would still be valid. 

In order to describe more fully the dependencies inherent in the eliminations 



we shall observe a snippet of an example. In Figure 3.1a, to the left we have the 



row identifications, the empty circles represent zeroed elements, and the filled circles 
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represent the pivots used to zero out the elements. The first column's eliminations 
are shown in green and the second in red. From the elimination list, we define I Sy k as 
the set of rows in column k that are zeroed out at time step s. 



u 
□ 



(a) Diagram of elimina- 
tion list 




e/im(13,10,l) 
e/im(14,ll,l) 
e/im(15,12,l) 
elim(9, 5, 1) 
elim(10, 6, 1) 
elim(ll, 7, 1) 
eftm(12,8,l) 
e/im(15,14,2) 
efe'm(12,9,2) 
eKm(13, 10,2) 
eftm(14,ll,2) 



^2,2 =► 

(b) Elimination list 



elim(I Sul , 1) 

elim(I S2t i, 1) 

elim(I Su2 ,2) 
elim(I S2t2 ,2) 



What may not be so evident from the elimination list but is more apparent in 
the diagram of the elimination list are the following dependency relationships (note 
that -< indicates that the operation on the left must finish prior to the operation on 
the right starting): 



elim(piv(I Sj k, k), k — 1) -< elim(I S} k, k) 
elim(I s fi, k — 1) -< elim(I Sj k, k) 
elim(I s -i,k,k) -< elim(I Sj k,k) 



(3.1a) 
(3.1b) 
(3.1c) 



However, not all of these dependencies may cause an elimination to be locked to a 
particular time step. In fact, some dependencies may not be needed for a particular 
instance, but the addition of these will not create an artificial lock. For example, 
elim(I S2> 2, 2) is dependent upon elim(piv(I S2 ^), 1), e/zm(/ S2j2 , 1), and elim(l sit2 , 2) but 
elim(I su 2,2) only depends upon elim(piv(I slt2 ), 1) and elim(I sli 2, 1). 
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Table 3.1: Kernels for tiled QR. The unit of time is where is the blocksize. 





Panel 


Update 


Operation 


Name 


Cost 


Name Cost 


Factor square into triangle 


GEQRT 


4 


UNMQR 6 


Zero square with triangle on top 


TSQRT 


6 


TSMQR 12 


Zero triangle with triangle on top 


TTQRT 


2 


TTMQR 6 



Before formally stating the conditions that guarantee the validity of (the elimi- 
nation list of) an algorithm, we explain how orthogonal transformations can be im- 
plemented. 

3.1.1 Kernels 

To implement a given orthogonal transformation elim(i,piv(i, k), k), one can use 



six different kernels, whose costs are given in Table 3A In this table, the unit of time 
is the time to perform ^ floating-point operations. 

There are two main possibilities to implement an orthogonal transformation 
elim(i,piv(i, k), k): The first version eliminates tile (i, k) with the TS (Triangle on 



top of square) kernels, as shown in Algorithm 3.2 



Algorithm 3.2: Elimination elim(i,piv(i, k), k) via TS (Triangle on top of 
square) kernels. 



1 GEQRT (piv(i, k), k) 

2 TSQRT(i,piv(i, k), k) 

3 for j = k + 1 to q do 



UNMQR(piv(i,k),k,j) 
TSMQR(i,piv(i,k),k,j) 



Here the tile panel (piv(i, k), k) is factored into a triangle (with GEQRT). The 
transformation is applied to subsequent tiles (piv(i, k),j), j > k, in row piv(i, k) (with 
UNMQR). Tile (i, k) is zeroed out (with TSQRT), and subsequent tiles (i, j), j > k, 
in row i are updated (with TSMQR). The flop count is 4 + 6 + (6 + 12) (q — k) = 

23 



10 + 18(q — k) (expressed in same time unit as in Table 3.1). Dependencies are the 
following: 

GEQRT(piv(i, k), k) -< TSQRT(i,piv(i,k),k) 
GEQRT(piv(i, k), k) -< UNMQR(piv(i, k), k,j) for j > k 
UNMQR(piv(i,k),k,j) -< TSMQR{i,piv{i,k),k,j) for j > k 
TSQRT(i,piv(i, k), k) -< TSMQR(i,piv(i, k), k,j) for j > k 

TSQRT(i,piv(i, k), k) and UNMQR(piv(i, k), k,j) can be executed in parallel, as well 
as UNMQR operations on different columns j,j' > k. With an unbounded number 
of processors, the parallel time is thus 4 + 6 + 12 = 22 time-units. 

Algorithm 3.3: Elimination elim(i,piv(i, k), k) via TT (Triangle on top of 
triangle) kernels. 

1 if k > then 

2 | TTQRT(i,piv(i,k),k) 

3 if k < q then 
for j = k + 1 to g do 

5 if k > then 

o I TTMQR{i,piv{i,k),k,j) 



GEQRT(i, k + 1) 
for j = k + 2 to q do 

[ UNMQR(i, k,j) 



The second approach to implement the orthogonal transformation 
elim(i,piv(i, k), k) is with the TT (Triangle on top of triangle) kernels, as shown 



in Algorithm 3.3 Here tile (i, k) is zeroed out (with TTQRT) and subsequent tiles 
(z,j) and (piv(i, k), j), j > k, in rows i and piv(i,k) are updated (with TTMQR). 
Immediately following, tile (i, k + 1) is factored into a triangle and the correspond- 
ing transformations are applied to the remaining columns in row i. Necessarily, 
TTQRT must have the triangularization of tile (i, k) and (piv(i,k),k) completed 
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in order to proceed. Hence for the first column there are no updates to be ap- 
plied from previous columns such that the triangularization of these tiles (with 
GEQRT) is completed and can be considered a preprocessing step. The flop count is 
2(4 + 6(q — k)) + 2 + 6(q — k) = 10 + 18(g — k), just as before. Dependencies are the 
following: 



GEQRT (piv(i, k), k) -< UNMQR(piv(i,k),k,j) 


for j 


> k 


(3.2a) 


GEQRT (i, k) -< UNMQR(i, k,j) 


for j 


> k 


(3.2b) 


GEQRT(piv(i, k), k) -< TTQRT (i,piv(i, k), k) 






(3.2c) 


GEQRT (i, k) -< TTQRT(i,piv(i,k),k) 






(3.2d) 


TTQRT(i,piv{i,k),k) -< TTMQR(i,piv(i, k), k,j) 


for j 


> k 


(3.2e) 


UNMQR(piv(i,k),k,j) -< TTMQR(i,piv(i,k),k,j) 


for j 


> k 


(3.2f) 


UNMQR{i, k,j) -< TTMQR(i,piv(i,k),k,j) 


for j 


> k 


(3-2g) 



Now the factor operations in row piv(i, k) and i can be executed in parallel. Moreover, 
the UNMQR updates can be run in parallel with the TTQRT factorization. Thus, 
with an unbounded number of processors, the parallel time is 4+6+6 = 16 time-units. 

Recall our definition of the set I S) & to be the set of rows in column k that will be 
zeroed out at time step s in the coarse-grain algorithm. Thus the following depen- 



dencies are a direct consequence of |3.1c| as applied to the zeroing of a tile and the 
corresponding updates. 

TTQRT(I,_ 1 ,piv(I a - 1 ,k),k) -< TTQRT(I s ,piv(I s ,k),k) (3.3a) 
TTQRT '(Is-i,piv(I a -i, k), k,j) -< TTMQR(I s ,piv(I s , k), k,j) for j > k (3.3b) 



In Algorithm |3.2| and |3.3| it is understood that if a tile is already in triangular 

form, then the associated GEQRT and update kernels do not need to be applied. 

All the new algorithms introduced in this chapter are based on TT kernels. From 

an algorithmic perspective, TT kernels are more appealing than TS kernels, as they 

25 



offer more parallelism. More precisely, we can always break a TS kernel into two TT 
kernels: We can replace a TSQRT(i,piv(i, k), k) (following a GEQRT(piv(i, k), k)) by 
a GEQRT(i, k) and a TTQRT(i,piv(i,k),k). A similar transformation can be made 
for the updates. Hence a TS'-based tiled algorithm can always be executed with TT 
kernels, while the converse is not true. However, the TS kernels provide more data 



locality, they benefit from a very efficient implementation (see £3.3), and several ex- 
isting algorithms use these kernels. For all these reasons, and for comprehensiveness, 
our experiments will compare approaches based on both kernel types. 

At this point, the PLASMA library only contains TS kernels. We have mapped 
the PLASMA algorithm to TT kernel algorithm using this conversion. Going from a 
TS kernel algorithm to a TT kernel algorithm is implicitly done by Hadri et al. [TT] 
when going from their "Semi-Parallel" to their "Fully-Parallel" algorithms. 

3.1.2 Elimination lists 

As stated above, any algorithm factorizing a tiled matrix of size p x q is charac- 
terized by its elimination list. Obviously, the algorithm must zero out all tiles below 
the diagonal: for each tile (i,k), i > k, 1 < k < min(p, q), the list must contain 
exactly one entry elim{i,-k, k), where * denotes some row index piv(i, k) . There are 
two conditions for a transformation elim(i,piv(i, k), k) to be valid: 

• both rows i and piv(i, k) must be ready, meaning that all their tiles left 
of the panel (of indices (i, k') and (piv(i, k), k') for 1 < k' < k) must 
have already been zeroed out: all transformations elim(i,piv(i, k'), k') and 
elim(piv(i, k),piv(piv(i, k), k'), k') must precede elim(i,piv(i, k), k) in the elim- 
ination list 

• row piv(i, k) must be a potential annihilator, meaning that tile (piv(i, k), k) has 
not been zeroed out yet: the transformation elim(piv(i, k),piv(piv(i, k), k), k) 
must follow elim(i,piv(i, k), k) in the elimination list 
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Any algorithm that factorizes the tiled matrix obeying these conditions is called a 
generic tiled algorithm in the following. 

Theorem 3.1 No matter what elimination list (any combination of TT, TS) is used 
the total weight of the tasks for performing a tiled QR factorization algorithm is 
constant and equal to 6pq 2 — 2q 3 . 

Proof: We have that the quantity of each kernel is given by the following 

Li :: GEQRT = TTQRT + q 

L 2 :: UNMQR = TTMQR + (l/2)q(q - 1) 

L 3 :: TTQRT + TSQRT = pq - (l/2)q(q + 1) 

U ■■ TTMQR + TSMQR = (l/2)pq(q - 1) - (l/6)q(q -l)(q+ 1) 
The quantity of TTQRT provides the number of tiles zeroed out via a triangle on 
top of a triangle kernel. Thus equation L\ is composed of two parts: necessarily, 
the diagonal tiles must be triangularized and each TTQRT must admit one more 
triangularization in order to provide a pairing. The number of updates of these trian- 
gularizations, given by L 2 , is simply the sum of the updates from the triangularization 
of the tiles on the diagonal and the updates from the zeroed tiles via TTQRT . The 
combination of TTQRT and TSQRT, equation L 3 , is exactly the total number of 
tiles that are zeroed, namely every tile below the diagonal. Hence, the total number 
of updates, provided by L 4 , is the number of tiles below the diagonal beyond the first 
column minus the sum of the progression through the columns. Now we define 

L 5 = 4Li + 6L 2 + 6L 3 + 12L 4 

then 

L 5 = AGEQRT + QTSQRT + 2TTQRT + 6UNMQR + QTTMQR + 12TSMQR. 

As can be noted in L 5 , the coefficients of each term correspond precisely to the weight 

of the kernels as derived from the number of flops each kernel incurs. Simplifying L5, 
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we have our result 

L 5 = 6pq 2 - 2q 3 . 



A critical result of Theorem |3.1| is that no matter what elimination list is used, the 
total weight of the tasks for performing a tiled QR factorization algorithm is constant 
and by using our unit task weight of nf/3, with m = pnb, and n = qrib, we obtain 
2mn 2 — 2/3n 3 flops which is the exact same number as for a standard Householder 
reflection algorithm as found in LAPACK (e.g., |!4j). 

3.1.3 Execution schemes 

In essence, the execution of a generic tiled algorithm is fully determined by its 
elimination list. This list is statically given as input to the scheduler, and the exe- 
cution progresses dynamically, with the scheduler executing all required transforma- 
tions as soon as possible. More precisely, each transformation involves several kernels, 
whose execution starts as soon as they are ready, i.e., as soon as all dependencies have 
been enforced. Recall that a tile (i, k) can be zeroed out only after all tiles (i,k'), 
with k' < k, have been zeroed out. Execution progresses as follows: 

• Before being ready for elimination, tile (i, k),i > k, must be updated k— 1 times, 
in order to zero out the k — 1 tiles to its left (of index (i, k'), k' < k). The last 
update is a transformation TTMQR(i,piv(i, k — l),k — l,k) for some row index 
piv(i, k — 1) such that elim(i,piv(i, k — 1), k — 1) belongs to the elimination list. 
When completed, this transformation triggers the transformation GEQRT(i, k), 
which can be executed immediately after the completion of the TTMQR. In 
turn, GEQRT(i, k) triggers all updates UNMQR(i, k, j) for all j > k. These 
updates are executed as soon as they are ready for execution. 

• The elimination elim(i,piv(i, k), k) is performed as soon as possible when both 
rows i and piv(i,k) are ready. Just after the completion of GEQRT(i,k) and 
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GEQRT(piv(i, k),k), kernel TTQRT(i, piv(i, k), k) is launched. When finished, 
it triggers the updates TTMQR{i,piv{i, k), k,j) for all j > k. 

Obviously, the degree of parallelism that can be achieved depends upon the elim- 
inations that are chosen. For instance, if all eliminations in a given column use the 
same factor tile, they will be sequentialized. This corresponds to the flat tree elimi- 
nation scheme described below: in each column k, it uses elim(i, k, k) for all i > k. 
On the contrary, two eliminations elim(i,piv(i,k),k) and elim(i' ,piv(i' , k), k) in the 
same column can be fully parallelized provided that they involve four different rows. 
Finally, note that several eliminations can be initiated in different columns simulta- 
neously, provided that they involve different pairs of rows, and that all these rows are 
ready (i.e., they have the desired number of leftmost zeros). 

The following lemma will prove very useful; it states that we can assume w.l.o.g. 
that each tile is zeroed out by a tile above it, closer to the diagonal. 

Lemma 3.2 Any generic tiled algorithm can be modified, without changing its exe- 
cution time, so that all eliminations elim(i,piv(i, k), k) satisfy % > piv(i, k). 

Proof: Define a reverse elimination as an elimination elim(i,piv(i, k), k) where 
% < piv(i,k). Consider a generic tiled algorithm whose elimination list contains 
some reverse eliminations. Let ko be the first column to contain one of them. Let 
io be the largest row index involved in a reverse elimination in column ko- The 
elimination list in column ko may contain several reverse eliminations elim{i\,%o, ko), 
elim(i 2 , io, ko), . . . , elim{i r , i , ko), in that order, before row i is eventually zeroed out 
by the transformation elim[%o, piviio, ko), k ). Note that piv{io, ko) < io by definition 
of io- We modify the algorithm by exchanging the roles of rows io and %\ in column 
k : the elimination list now includes elim(i , i\, ko), elim(i 2 , i\, ko), . . . , elim{i r , i\, k ), 
and elim{ii, piviio, k ), k Q ). All dependencies are preserved, and the execution time is 
unchanged. Now the largest row index involved in a reverse elimination in column ko 
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is strictly smaller than i , and we repeat the procedure until there does not remain any 
reverse elimination in column fc . We proceed inductively to the following columns, 
until all reverse eliminations have been suppressed. ■ 

3.2 Critical paths 

In this section we describe several generic tiled algorithms, and we provide their 

critical paths, as well as optimality results. These algorithms are inspired by algo- 
rithms that have been introduced twenty to thirty years ago [I5J ESI ESI [23] , albeit for 
a much simpler, coarse-grain model. In this "old" model, the time-unit is the time 
needed to execute an orthogonal transformation across two matrix rows, regardless 
of the position of the zero to be created, hence regardless of the length of these rows. 
Although the granularity is much coarser in this model, any existing algorithm for 
the old model can be transformed into a generic tiled algorithm, just by enforcing 
the very same elimination list provided by the algorithm. Critical paths are obtained 
using a discrete event based simulator specially developed to this end, based on the 
Simgrid framework [47J. It carefully handles dependencies across tiles, and allows for 
the analysis of both static and dynamic algorithms^] 

3.2.1 Coarse-grain algorithms 

We start with a short description of three algorithms for the coarse-grain model. 



These algorithms are illustrated in Table 3.2 for a 15 x 6 matrix. 



3.2.1.1 Sameh-Kuck algorithm 

The Sameh-Kuck algorithm [32] uses the panel row for all eliminations in each 

column, starting from below the diagonal and proceeding downwards. Time-steps 

indicate the time-unit at which the elimination can be done, assuming unbounded 

resources. Formally, the elimination list is 

{(elim(i, k, k), % = k + 1, k + 2, . . . ,p) , k = 1, 2, . . . , min(p, q)} 

1 The dis crete event based simulator, together wit h the code for a ll tiled algorithms, is publicly 
available at http://graal.ens-lyon.fr/~mjacquel/tiledQR.litml 
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This algorithm is also referred as FlatTree. 

3.2.1.2 Fibonacci algorithm 

The Fibonacci algorithm is the Fibonacci scheme of order 1 in [36]. Let 

coarse(i, k) be the time-step at which tile (i,k), % > k, is zeroed out. These val- 
ues are computed as follows. In the first column, there are one 5, two 4's, three 3's, 
four 2's and four l's (we would have had five l's with p = 16). Given x as the least 
integer such that x(x + l)/2 > p — 1, we have coarse(i, 1) = x — y + 1 where y is the 
least integer such that i < y(y + l)/2 + 1. Let the row indices of the z tiles that are 
zeroed out at step s, 1 < s < x, range from % to i + z — 1. The elimination list for these 
tiles is elim(i + j , piv(i + j , 1), 1), with piv(i + j) = i+j — z for < j < z — 1. In other 
words, to eliminate a bunch of z consecutive tiles at the same time-step, the algorithm 
uses the z rows above them, pairing them in the natural order. Now the elimination 
scheme of the next column is the same as that of the previous column, shifted down 
by one row, and adding two time-units: coarse(i, k) = coarse(i — 1, k — 1) + 2, while 
the pairing obeys the same rule. 

3.2.1.3 Greedy algorithm 

At each step, the Greedy algorithm [2U [25] eliminates as many tiles as possible 

in each column, starting with bottom rows. The pairing for the eliminations is done 
exactly as for Fibonacci. There is no closed-form formula to compute coarse(i, k), 
the time-step at which tile (i, k) is eliminated, but it is possible to provide recursive 
expressions (see [2H EH] ) . 

Consider a rectangular p x q matrix, with p > q. With the coarse-grain model, 
the critical path of Sameh-Kuck is p + q - 2, and that of Fibonacci is x + 2q - 2, 
where x is the least integer such that x(x + l)/2 > p — 1. The critical path of 
Greedy is unknown, but the critical path of Greedy is optimal. For square q x q 
matrices, critical paths are slightly different (2q — 3 for Sameh-Kuck, x + 2q — 4 for 
Fibonacci). 
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Table 3.2: Time-steps for coarse-grain algorithms. 
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3.2.2 Tiled algorithms 

As stated above, each coarse-grain algorithm can be transformed into a tiled 
algorithm, simply by keeping the same elimination list, and triggering the execution 
of each kernel as soon as possible. However, because the weights of the factor and 
update kernels are not the same, it is much more difficult to compute the critical 



paths of the transformed (tiled) algorithms. Table 3.3 is the counterpart of Table 3.2 
and depicts the time-steps at which tiles are actually zeroed out. Note that the tiled 
version of Sameh-Kuck is indeed the FlatTree algorithm in PLASMA [IEJQJ2, and 



we have renamed it accordingly. As an example, Algorithm |3.4| shows the Greedy 
algorithm for the tiled model. 

A first (and quite unexpected) result is that Greedy is no longer optimal, as 



shown in the first two columns of Table 3.3a for a 15 x 2 matrix. In each column and 



at each step, "the Asap algorithm" starts the elimination of a tile as soon as there are 

at least two rows ready for the transformation. When s > 2 eliminations can start 

simultaneously, Asap pairs the 2s rows just as Fibonacci and Greedy, the first row 

(closest to the diagonal) with row s+1, the second row with row s + 2, and so on. As a 
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matter of a fact, when processing the second column, both Asap and Greedy begin 
with the elimination of lines 10 to 15 (at time step 20). However, once tiles (13,2), 
(14,2) and (15,2) are zeroed out (i.e. at time step 22), Asap eliminates 4 zeros, in 
rows 9 through 12. On the contrary, Greedy waits until time step 26 to eliminate 6 
zeros in rows 6 through 12. In a sense, Asap is the counterpart of Greedy at the 



tile level. However, Asap is not optimal either, as shown in Table 3.3a for a 15 x 3 
matrix. On larger examples, the critical path of Greedy is better than that of Asap, 
as shown in Table I3.3bl 

We can however use the optimality of the coarse-grain Greedy to devise an 
optimal tiled algorithm. Let us define the following algorithm: 

Definition 3.3 Given a matrix ofpxq tiles, withp > q, the GrASAP (i) algorithm 



1. uses Algorithm 3.3 to execute Greedy on the first q — i columns and propagate 



the updates through column q. 
2. and for column (s) q — i + 1 through q, apply the ASAP algorithm. 

Clearly, if we let i = q we obtain the Asap algorithm. We define GrASAP to be 
GrASAP (1), i.e., only the elimination of the last column will differ from Greedy, 
and we will show that GrASAP is an optimal tiled algorithm. 

Although we cannot provide an elimination list for the entire tiled matrix of 
size p x q, we do provide an elimination list for the first q — 1 columns. This tiled 



elimination list describes the time-steps at which Algorithm 3.3 is complete, i.e., all of 
the factorization kernels are complete for k < q — 1 and corresponding update kernels 
are complete for all columns k < j < q. 

We must make note of one consequence from the coarse-grain elimination list 
before proceeding. We will use this repeatedly within the proof of translating a 
coarse-grain elimination list to a tiled elimination list. 
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Lemma 3.4 Given an elimination list from any coarse-grain algorithm, let 
s = coarse(i, k) be the time step at which element (i, k) is eliminated and let 

I s ,k = ON = coarse(I S)k , k)} . 



Then for any s we have 

s — 1 = coarse(I s ^, k) — 1 > max 
and in particular 



\ 



s\ — 1 = max 



coarse{I St ki k — 1) 
^coarse(piv(I Stk , k), k - 1) 



coarse(I Sli k, k — 1) 
^coarse(piv(I Slt k, k), k — 1) 



where s± = min fc+1 <j< p (coarse(i, fc)). 



Proof: This follows directly from the dependencies given in (3.1a)-(3.1c) 



Theorem 3.5 Given the elimination list of a coarse-grain algorithm for a matrix of 



size px q, using Algorithm 3.3 , the tiled elimination list for all but the last column is 
given by 

tiled(i, k) = 10k + 6 • coarse(i, k), l<i<p,l<k<q— 1 

where coarse(i, k) is the elimination list of the coarse-grain algorithm. 

Proof: In this analysis, when k is clear, we will use I s instead of I Sj k- By abuse of no- 
tation, we will write GEQRT(i, k) to denote the time at which the task GEQRT(i, k) 
is complete and this will be the same for all of the kernels. Thus we will prove that 

tiled(i,k) = TTMQR(i,piv(i, k), k, j) for j > k. 

Note that j represents the column in which the updates are applied and all columns 

j for j > k have the same update history. In Algorithm |3.3[ the two j-loops spawn 
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mutually independent tasks. Since we have an unbounded number of processors, these 
tasks can all run simultaneously. So j represents any one of these columns. 

We will proceed by induction on k. For the first column, k = 1, we do not have 



any dependencies which concern the GEQRT operations. Thus from (3.2b) we have 
for 1 < % < p, 



GEQRT(i, 1) = 4 
UNMQR(i, 1, j) = 4 + 6 = 10 



(3.4) 
(3.5) 



Since each column in the coarse-grain elimination list is composed of one or more 
time steps, we must also proceed with induction on the time steps. Let 



Si = min (coarse(i, 1)) 

2<i<p 



(3.6) 



In the case k — 1, we have that 



Sl = l. 



In other words, the first tasks finish at time step 1 for the coarse-grain algorithm. 
This is a complicated manner in which to state that s± = 1, but it will be needed in 
the general setting. 



So for si, from (3.2c) and (3. 2d) we have 



TTQRT(I sl ,piv(I sl ,l), 1) = max 



' GEQRT(piv(I Sl ,l),iy 



\ 



GEQRT(I S1 ,1) 



+ 2 



4 + 2 



thus 



TTQRT{I Sl ,piv{I Sl ,l),l) =4 + 2 Sl 



(3.7) 
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TTMQR(I sl ,piv(I sl , 1), 1, j) = max 



Now from (3.2e), (3.2f), and (3.2g) we have 

'TTQRT(I Sl ,piv(I Sl1 l),l)\ 

UNMQR(piv(I tl ,l),l,3) 
\ UNMQR(I Sl ,l,j) J 

10 • 1 + 6si 

10 ■ 1 + 6 ■ coarse(I Sl , 1) 



Therefore, 



TTMQR (I S1 ,piv(I ai ,l),l,j) = tiled{I sl , 1) . 



Assume that for l<t<s — lwe have 



(3. 



TTQRT(I t ,piv(I t , 1), 1) = 4 + 2t 
TTMQR(I t ,piv(I t , 1), 1, j) = 10 + 6t 



then from (3.2c), (3. 2d), and (3.3a) we have 

/ 



TTQRT(I s ,piv(I s , 1), 1) = max 



GEQRT (piv(I s , 1), 1) 
G£Qi2T (7 S , 1) 



(3.9) 
(3.10) 



+ 2 



^ TTQRT (7 s _i,pzu(/ s _i, 1), 1) y 



max 



4 
4 



^4 + 2(5-1) J 



+ 2 



Thus 



TTQRT(I s ,piv(I s , 1), 1) = 4 + 2s. 



(3.11) 



3(3 



From fl3~2e) , ( pOfft , ggg ), and fl3~3bl ) we have 

/ 



TTMQR(I s ,piv(I s , 1), 1, j) = max 



\ 



max 



+ 6 



Thus 



TTQRT(I s ,piv(I s ,l),l 

UNMQR(piv(I s ,l),lJ 
UNMQR (I 8 ,1J) 
\ TTMQR (I^pivil^, 1), l,j)J 

( 4 + 2s ^ 
10 
10 

\10 + 6(s-l) J 
= 10 + 6(s- 1) + 6 
= 10 + 6s 

= 10 • 1 + 6 • coarse(I s , 1) 
TTMQR(I s ,piv(I s , 1), l,j) = tiled(I s , 1) 



+ 6 



(3.12) 



establishing our base case for the induction on k. 

Now assume that for 1 < h < k — 1 we have, for any s in column h, 

TTMQR(I s ,piv(I s , h), h,j) = tiled{I s , h). 

In order to start the elimination of the next column, we must have that all updates 
from the elimination of the previous column are complete. Thus using the induction 
assumption, we have 

GEQRT(i, k) = TTMQR(i,piv(i,k - l),k - l,k) + 4 



so that 



GEQRT(i, k) = 10(k - 1) + 6 ■ coarse(i, k - 1) + 4 



(3.13) 
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and 



UNMQR(i,k,j) = max 



/ 



GEQRT(i, k) 
TTMQR(i,piv(i, k - 1), k - 1, j 



\ 



+ 6 



so that 



UNMQR(i, k,j) = 10k + 6 • coarse(i, k-1). 



(3.14) 



Again, we must proceed with an induction on the time steps in column k. Let 



Si = min (coarse(i, k)) 
k+i<i< P 



(3.15) 



From (3.2c) and (3.2d) we have 



TTQRT(I Sl ,piv(I Sl ,k),k) = max 



' GEQRT(piv(I n ,k),ky 



\ 



GEQRT (I S1 , k) 



+ 2 



/ 



max 



10(fc — 1) + 6 • coarse(piv(I Sl , k), k — 1) + 4 
^ W(k - 1) + 6 • coarse{I sl , k - 1) + 4 



+ 2 



10(Jfe- 1) + 6 max 



^ coarse(piv(I sl , k),k — 1) 
^ coarse(I sl , k — 1) ^ 



+ 4 + 2 



From the application of Lemma |3.4[ we have 

/ 

coarse(I Sl , k) — 1 = max 



coarse{I Sl ^ } k — 1) 
coarse(piv(I Sl! k, k), k — 1) 



such that 



TTQRT(I sl ,piv(I sl ,k),k) = 10{k - 1) + 6[coarse(I sl ,k) - 1] + 4 + 2. 



Therefore, 

TTQRT(I Sl ,piv(I Sl ,k),k) = 10(k - 1) + 6 Sl . 



(3.16) 
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For the updates, we must again examine the three dependencies which result from 



(3.2e), (3.2f), and (3.2g) such that we have 

/ 



TTMQR(I Sl ,piv(I Sl , k), k, j) = max 



\ 



+ 6 



max 



UNMQR(piv(I Sl ,k),k } ] 
UNMQR(I Sl ,k,j) 
\TTQRT (I Sl ,ptv(I Sl ,k),k) j 

10k + 6 ■ coarse(I Sl , k — 1) 
10A; + 6 ■ coarse(piv(I Sl , k), k — 1 
^10(A;- 1) +6si 



/ 



Using Lemma 3.4, we have 



TTMQR(I Sl ,piv(I Sl ,k),k,j) = max 



^10A; + 6(si - 1^ 



V 



10(fc-l) + 6si 
10k + max 



OSi — o 



6si - 10 



+ 6 



+ 6 



lOfc + 6si 



Therefore 



TTMQR{I sl ,piv{I sl ,k),k,j) = Uled(I Sl ,k). 



(3.17) 



Now assume that for s-i < t < s — 1 we have 



TTQRT(I t ,piv(I t , k), k) < 10(Jfe - 1) + 6t 
TTMQR(I t ,piv(I t ,k),k,j) = 10k + 6t. 



(3.18) 
(3.19) 
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and note that we do not have equality for s > s% via Lemma 3.4 From (3.2c), (3. 2d) 



and (3.3a) we have 



TTQRT(I s ,piv(I s , k),k) = max 



< max 



^ GEQRT(piv(I s ,k),k) ^ 

GEQRT (I„ k) 
y TTQRT (Is-^pivils-!, k), k) J 

'I0(k - 1) + 6 • coarse(piv(I s ,k),k - 1) +4^ 
10(fc - 1) + 6 • coarse{I s , k - 1) + 4 
10(k- 1) + 6(s- 1) 



+ 2 



Note that from Lemma EI 



s — 1 > max 



coarse(piv(I s , k),k — 1) 
coarse(I s , k — 1) 



such that 



Thus 



TTQRT (I s ,piv(I s , k), k) < 10(fc - 1) + 6(s - 1) + 4 + 2. 



TTQRT (I s ,piv(I s , k), k) < 10(fc - 1) + 6s. 



(3.20) 



For the updates, we must examine the four dependencies which result from (3.2e) 



(3.2f), (3.2g), and (3.3b) such that we have 

/ 



TTMQR(I s ,piv(I s ,k),k,j) = max 



max 



UNMQR(piv(I S} k) } k,]) 
UNMQR(I s ,k,j) 
TTQRT (I s ,piv(I s , k), k) 
\TTMQR(I s ^,piv(I s ^,k),k,j) J 

10k + 6 • coarse(I s , k — 1) 
10A; + 6 • coarse(piv(I s , k), k — 1) 
10(k - 1) + 6s 
yi0fc + 6(s-l) y 



+ 6 



+ 6. 
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As before, Lemma 13.41 allows us to write 



TTMQR(I s ,piv(I s , k), k,j) = max 



10k + max 



lOfc + 6s 



^10£; + 6(s- 1) 
^10(fc- 1) + 6s^ 

OS — 6 



6s - 10 



+ 6 



Therefore 



TTMQR(I s ,piv(I s ,k),k,j) = tiled{i,k). 



(3.21) 



Corollary 3.6 Given an elimination list for a coarse-grain algorithm on a matrix of 
size p x q where p > q, the critical path length of the corresponding tiled algorithm is 
bounded by 

tiled(p, g-l)+4 + 2< CP(p, q) < tiled(p, q). 

Proof: For any tiled matrix, the last column will necessarily need to be factorized 
which explains the addition of four time steps and since p > q at least one TTQRT will 
be present which accounts for the two time steps thereby establishing the lower bound. 
By including one more column, the upper bound not only includes the factorization 
of column q, but also the respective updates onto column q + 1 such that the critical 
path of the p x q tiled matrix must be smaller. ■ 

Corollary 3.7 Given an elimination list for a coarse-grain algorithm on a matrix of 
size p x q where p = q, the critical path length of the corresponding tiled algorithm is 



CP(p, q) = tiled(p, q - 1) + 4. 

Proof: In the last column, we need only to factorize the diagonal tile which explains 

the additional four time steps. Moreover, there are no further columns to apply any 
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updates to nor any tiles below the diagonal that need to be eliminated. Thus the 
result is obtained. ■ 
In the remainder of this chapter, we will make use of diagrams to clarify certain 
aspects of the proofs and provide examples to further illustrate the points being made. 



These diagrams make use of the kernel representations as shown in Figure 3.1 



kernel weight kernel weight 



GEQRT m ♦ 3 4 UNMQR H~ ♦— H 6 

TTQRT 2 TTMQR 6 



Figure 3.1: Icon representations of the kernels 

We have a closed-form expression for the critical path of tiled FlatTree for all 
three cases: single tiled column, square tiled matrix, and rectangular tiled matrix of 
more than one column. 

Proposition 3.8 Consider a tiled matrix of size px q, where p > q > 1. The critical 
path length of FlatTree is 



CP ft {p,q) = { 



2p + 2, tfq = l; 

22p-24, ifp = q>l; 
Qp+lQq -22,ifp>q>l. 

Proof: Consider first the case q = 1. We shall proceed by induction on p to 
show that the critical path of FlatTree is of length 2p + 2. If p — 1, then from 



Table 3.1 the result is obtained since only GEQRT {1,1) is required. With the base 
case established, now assume that this holds for all p — 1 > q — 1. Thus at time 
t = 2{p — 1) + 2 = 2p, we have that for all p — 1 > i > 1 tile {i, 1) has been factorized 
into a triangle and for slip — 1 > i > 1, tile (i, 1) has been zeroed out. Therefore, tile 
(p, 1) will be zeroed out with TTQRT {p, 1) at time t + 2 = 2{p - 1) + 2 + 2 = 2p + 2. 
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Considering the second case p = q > 1, we will be using Figure 3J2 to illustrate. 
We initialize with a triangularization of the first column and send the update to the 
remaining column(s), 10 time units. The we fill the pipeline with the updates onto 
the remaining column(s) from the zeroing operations of the first column, 6(p — l) time 
units. Then for each column after the first, except the last one, we fill the pipeline 
with the triangularization, update of triangularization, and update of zeroing for the 
bottom most tile, (4+6+6) (p— 2) time units. In the last column, we then triangularize 
the bottom most tile, 4 time units. Thus 

10 + 6(p - 1) + (4 + 6 + 6)(g - 2) + 4 = Op + lOq - 24 = 22p - 24 

The third case is analogous to the second case but we still need to zero out the 
bottom most tile in the last column which explains the difference of 2 in the formula 
from the square case. ■ 

time 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52 54 56 58 60 62 64 

U-^-?v- fr + -n n - rs n- n n 
U — <y~?>-6 * - o i f 



m-o~T>-6 



3-a 




^-6 — a-S — «»-f~t 

3'"' Column 1 1 



Fill the pipeline: 6(p- 1) 



Pipeline: 16(g - 2) 



Figure 3.2: Critical Path length for the weighted FlatTree on a matrix of 4 x 4 
tiles. 



We remind that for the coarse algorithm, 



coarse(p, q) 



0, if g < 1 ; 

0, iSp = q = l; 

p + q — 2, if p > g > 1; 
2p — 3, if p = q > 1. 
So we find that considering a tiled matrix of size p x q, where p > q > 1. The critical 
path length of FlatTree is given as 

CP(p, q) = 10 (q — 1) + 6 (coarse(p, q — 1)) + 4 + 2 (coarse(p, q) — coarse(p, q — 1)) . 
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Proposition 3.9 The critical path length of Fibonacci is greater than 22q — 30 and 
less than 22q + 6 [-y/2p] . 

Proof: The critical path length of the coarse-grain Fibonacci algorithm for a 
p x q matrix is 

coarse(p, q) = x + 2q — 2. 



Thus from Proposition 3.6 we have 



10(q - 1) + 6(z + 2{q - 1) - 2) + 4 < CP(p, q) < 10g + 6(x + 2q - 2). 

Recall that x is the least integer such that 

x(x + l) > 
2 ~ ^ 

whereby 



Thus a; < and therefore 

22g - 30 < CP(j9, g) < 22g - 12 + 6 



Similarly to [25] in which an iterate of a column is defined for the coarse-grain 
algorithms, we define a weighted iterate and our notation will follow in the same 
manner. 

A column of length n is a sequence of n integers: 

a = a™ 1 • • • a n q q 

where power means concatenation with the following restrictions: 

di > a i+ i > di, 1 < i < q — 1; 

n>i> 0, 1 < i < q; ni + ■ ■ ■ + n q = n. 
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We define on the set of columns of length n the classical partial ordering of M. n : 



and the s-truncate (1 < s < n) of a is a column of length s composed of the s first 
elements of a and is denoted a s . 

Definition 3.10 Given a task weight of w and column a = a™ 1 ■ ■ ■ a\ q , the column 
c = c™ 1 ■ ■ ■ c™ p is called an iterate of a, or c = iter (a), if 

(i) c is a column of length n — 1 

(U) CLi + W < C\ 

(a) if ai + w < ci < a 2 then mi < \n\/2\ 

(b) if there exists an h such that a^-i + w < Ch < then 

m h < [(ni H h n fc _i - mi /2J 

/or 2 < k < q and 1 < h < p with m = 0. 
^ e/se aj < and 



where j = min(p + l,q). 

Definition 3.11 Given a task weight of w, let a = a™ 1 • • • aq q be a column iterate of 
length n then the sequence b = b™ 1 ■ ■ ■ b™ p , or b = opiter(a), is defined as 

(i) for bi and m\ 

(a) if ni = 1, then bi = a 2 + w and m 1 = [(rii + n 2 ) /2j. 

(b) if n\ > 1, then bi = a\ + w and mi = [(ni) /2J . 




m h < L( n i + • • • + Uj — mi 



mh-i) /2J 
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(ii) if there exists k such that a k _i + w < 6j_i < a k , then 

U-i =n\-\ V n k „i - mi m^ x > I. 

(a) if bi-i < a k and ri-i > 1, then bi = bi-i +w, and rrii = [n-i/2j. 

(b) else bi = a k + w, rrii = [(n k + r i _ 1 )/2\ . 

(Hi) ifbi-i > Cbj where j = min(i, q), then bi = bi-i + w, and 

mi < L( n i H \- rij — mi — • • • — rrii-i) /2J . 

Proposition 3.12 Given an iterated column a of length n, the sequence 

b = b™ 1 ---b™ p , 
or b = optiter(a) is an iterate of a. 

Proof: The proof follows directly from the definition. ■ 
Proposition 3.13 

(i) Let a n be a column of length n and c n _i = iter(a n ) an iterated column of a n . 
Then 

6 n _i = optiter(a n ) < iter(a n ) = c„_i. 

(ii) Let a n and c n be two columns of length n such that a n < c n . Then 

optiter(a n ) < optiter(c n ). 

Proof: 

(i) Clearly, b\ < c\ by definition since b\ is chosen to be as small as possible. 
Moreover, by definition bi < Cj for i < j since (i) if < a k and r^-i > 1, 
meaning there are enough elements available to perform a pairing, then bi is 
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again chosen as small as possible, (ii) otherwise 6j = a^ + w which is the smallest 
again, (iii) else > dj and 6, is chosen as the next smallest element. Thus 

im 1 +-+m i < m 1 +-+m i i <-,•<-„ 

so that b n -i < Cn-i. 

(ii) This is another direct application of the definition and follows along the same 
argument. 

■ 

Clearly, letting w — 1 gives the definitions of iter and optiter of the coarse-grain 



algorithms as presented in [25]. Definition 3.11 is the Asap algorithm on a single tiled 



column and can be viewed as the counter part of the coarse-grain Greedy algorithm 
in the tiled case and follows a bottom to top elimination of the tiles. In order to 
preserve the bottom to top elimination, the weight of the updates must be an integer 
multiple of the iterated column weight. 

Theorem 3.14 Given a matrix of p x q tiles, a factorization kernel weight 0/7, an 
elimination kernel weight of a, and an update kernel weight of (3 = na for some 
n G N, the GrASAP algorithm is optimal in the context of the class of algorithms 
that progress left to right, bottom to top. 



Proof: From Theorem 3.5 we have a direct translation from any coarse-grain al- 
gorithm in this class to the tiled algorithm for the first q — 1 columns. Thus we are 
given the time steps at which rows in column q are available for elimination. Now 
we fix the time steps for the elimination of the last column and follow whatever tree 
the algorithm provides for this last column. We can replace the elimination of the 
first q — 1 columns and updates from these eliminations onto the remaining columns 
with the tiled Greedy algorithm. This is possible since the translation function is 

monotonically increasing and we know that Greedy is optimal for the coarse-grain 
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algorithms and therefore optimal for the first q — 1 columns in the tiled algorithms. In 
another manner of speaking, we slow down the eliminations and updates on the first 
q — 1 columns when not using the tiled GREEDY algorithm. (An illustrative example 



is shown in Figure 3.3b 



Let c be the next to last column of the coarse-grain elimination table which is of 
length p — (q — 1) + 1. Now letting 

a = (7 + p)(q — 1) + p ■ coarse(p, q — 1) + 7 

provides an iterated column of length p — (q — 1) + 1 for the tiled algorithm. With 
w = a we have that b = optiter(a) is an optimal iterated column of length p — q + 1 
with the elimination progressing from bottom to top. This can be applied to any tiled 
algorithm in this class since we only concern ourselves with the time steps at which the 
last column's elements are available for elimination. In other words, this is a speeding 
up of the elimination of the last column while adhering to any restrictions incurred 



from the previous columns. (An illustrative example is shown in Figure 3.3c ) 

Combining these two ideas, let Greedy be performed on the first q — 1 columns 
and then ASAP on the remaining column q. This provides an optimal algorithm in 
this class of algorithms. ■ 



In Figure 3.4 we provide an illustrated example of the Greedy and GrASAP 



algorithms on a matrix of 15 x 2 tiles where the operations are given by Figure 3.1 

It can be seen that GrASAP finishes before Greedy since Greedy must wait 
and progress with the same elimination scheme as the coarse-grain algorithm while 
GrASAP can begin eliminating in the last column as soon as a pair of tiles becomes 
available. (The elimination of the first column is shown in light gray.) 

We have analyzed the critical path length of GrASAP versus that of Greedy 



for tiled matrices p x q where 1 < p < 100 and 1 < q < p (see Figure 3.5). In all 
cases where there is a difference (which is just over 44% of the cases), the difference 
is always two time steps. 
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(a) Fibonacci 



(b) Greedy on first q — 1 columns 




(c) Asap on column q 



Figure 3.3: Illustration of first and second parts of the proof of Theorem 3.14 
the Fibonacci algorithm on a matrix of 15 x 2 tiles. 



using 



We now show that without having the update kernel weight an integer multiple 
of the elimination kernel weight, the bottom to top progression is nullified and we 
cannot provide optimality of the algorithm. 

Assume that the update kernel weight is 3 and the elimination kernel weight is 
2. Let an = 3 7 6 4 be column from some elimination scheme. We shall apply three 
iterated schemes to this column: (1) an Asap scheme that progresses from bottom to 
top, (2) an Asap scheme that can progress in any manner, and (3) an Asap scheme 
which may provide a lag. 



In Table 3.5 we clearly see that elimination scheme (3) provides the best time for 
the algorithm. The reason is that enough of a lag was provided such that a binomial 
tree could progress without hindrance. Therefore without integer multiple weights on 
the update kernel, we cannot know which scheme will be optimal. 
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(a) Greedy (b) GrASAP 

Figure 3.4: Greedy versus GrASAP on matrix of 15 x 2 tiles. 

Matrix size for which the critical path of GrASAP is shorter than Greedy 

5 - 
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q 

Figure 3.5: Tiled matrices of p x q where the critical path length of GrASAP is 
shorter than that of GREEDY for 1 < p < 100 and 1 < q < p. 

The PLASMA library provides more algorithms, that can be informally described 
as trade-offs between FlatTree and BinaryTree. (We remind that FlatTree 
is the same as algorithm as Sameh-Kuck.) These algorithms are referred to as 
PlasmaTree in all the following, and differ by the value of an input parameter 
called the domain size BS. This domain size can be any value between 1 and p, 
inclusive. Within a domain, that includes BS consecutive rows, the algorithm works 
just as FlatTree: the first row of each domain acts as a local panel and is used to 
zero out the tiles in all the other rows of the domain. Then the domains are merged: 
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the panel rows are zeroed out by a binary tree reduction, just as in BinaryTree. 
As the algorithm progresses through the columns, the domain on the very bottom is 
reduced accordingly, until such time that there is one less domain. For the case that 
BS = 1, PlasmaTree follows a binary tree on the entire column, and for BS = p, 
the algorithm executes a flat tree on the entire column. It seems very difficult for a 
user to select the domain size BS leading to best performance, but it is known that 



BS should increase as q increases. Table 3.3 shows the time-steps of PlasmaTree 



with a domain size of BS = 5. In the experiments of £3^3 we use all possible values 
of BS and retain the one leading to the best value. 

3.3 Experimental results 

All experiments were performed on a 48-core machine composed of eight hexa- 
core AMD Opteron 8439 SE (codename Istanbul) processors running at 2.8 GHz. 
Each core has a theoretical peak of 11.2 Gflop/s with a peak of 537.6 Gflop/s for 
the whole machine. The Istanbul micro-architecture is a NUMA architecture where 
each socket has 6 MB of level-3 cache and each processor has a 512 KB level-2 cache 
and a 128 KB level-1 cache. After having benchmarked the AMD ACML and Intel 
MKL BLAS libraries, we selected MKL (10.2) since it appeared to be slightly faster 
in our experimental context. Linux 2.6.32 and Intel Compilers 11.1 were also used in 
conjunction with PLASMA 2.3.1. 

For all results, we show both double and double complex precision, using all 48 
cores of the machine. The matrices are of size m = 8000 and 200 < n < 8000. The 
tile size is kept constant at rib = 200, so that the matrices can also be viewed as 
p x q tiled matrices where p = 40 and 1 < q < 40. All kernels use an inner blocking 
parameter of ib = 32. 

In double precision, an FMA ("fused multiply- add" , y <— ax + y) involves three 

double precision numbers for two flops, but these two flops can be combined into one 

FMA and thus completed in one cycle. In double complex precision, the operation 
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Best domain size for PLASMATREE (TT) = [1 5 5 5 17 28 8] 



FlatTree (TT) 
PlasmaTree (TT) (best) 
Fibonacci (TT) 



2 3 4 5 6 7 8 9 10 



20 30 40 



(a) Upper bound (double complex) 



Best domain size for PlasmaTree (TT) = [ 1 3 5 5 5 10 10 10 10 10 20 ... 20 




FlatTree (TT) 
PlasmaTree (TT) (best) 
Fibonacci (TT) 
Greedy 



2 3 4 567891 



20 30 40 



(c) Upper bound (double) 




FlatTree (TT) 
PlasmaTree (TT) (best) 
Fibonacci (TT) 



2 3 4 5 6 7 8 9 10 



20 30 40 



(b) Experimental (double complex) 



Best domain size for PLASMATREE (TT) = [1 3 10 5 17 27 19] 




FlatTree (TT) 
PlasmaTree (TT) (best) 
Fibonacci (TT) 
Grei 



2 3 4 56789 10 



(d) Experimental (double) 



20 30 40 



Figure 3.6: Upper bound and experimental performance of QR factorization - TT 
kernels 



y <— ax + y involves six double precision numbers for eight flops; there is no FM A. 
The ratio of computation/communication is therefore, potentially, four times higher in 
double complex precision than in double precision. Communication aware algorithms 
are much more critical in double precision than in double complex precision. 

For each experiment, we provide a comparison of the theoretical performance to 
the actual performance. The theoretical performance is obtained by modeling the 
limiting factor of the execution time as either the critical path, or the sequential time 
divided by the number of processors. This is similar in approach to the Roofline 
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Best domain size for PlasmaTree (TT) = [ 1 3 5 5 5 10 10 1(1 10 10 21) . . . 20 ] 



FlatTree (TT) 
PlasmaTree (TT) (best) 
Fibonacci (TT) 
Greedy 




2 3 4 5 6 7 8 9 10 20 30 40 



(a) Theoretical CP length 



Best domain size for PLASMATREE (TT) = [1 5 5 5 17 28 8] 



Best domain size for PLASMATREE (TT) = [1 3 10 5 17 27 19] 



FlatTree (TT) 
PlasmaTree (TT) (best) 
Fibonacci (TT) 
Greedy 




FlatTree (TT) 
PlasmaTree (TT) (best) 
Fibonacci (TT) 




1 2 3 4 5 6 7 8 9 10 20 30 40 



(b) Experimental (double complex) 



1 2 3 4 5 7 8 9 10 20 30 40 



(c) Experimental (double) 



Figure 3.7: Overhead in terms of critical path length and time with respect to 
Greedy (Greedy = 1) 

model [53] . Taking ^ seq as the sequential performance, T as the total number of flops, 
cp as the length of the critical path, and P as the number of processors, the upper 
bound on performance, 7 U &, is 

Iseq ■ T 



Jub 



max 



(J,cp) 



Figures 3.6a and 3.6c depict the upper bound on performance of all algorithms which 
use the Triangle on top of triangle kernels. Since PlasmaTree provides an addi- 
tional tuning parameter of the domain size, we show the results for each value of this 
parameter as well as the composition of the best of these domain sizes. Again, it 
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is not evident what the domain size should be for the best performance, hence our 
exhaustive search. 

Part of our comprehensive study also involved comparisons made to the Semi- 
Parallel Tile and Fully-Parallel Tile CAQR algorithms found in [11 J which are much 
the same as those found in PLASMA. As with PLASMA, the tuning parameter 
BS controls the domain size upon which a flat tree is used to zero out tiles below 
the root tile within the domain and a binary tree is used to merge these domains. 
Unlike PLASMA, it is not the bottom domain whose size decreases as the algorithm 
progresses through the columns, but instead is the top domain. In this study, we found 
that the PLASMA algorithms performed identically or better than these algorithms 
and therefore we do not report these comparisons. 



Figure 3.6b and 3.6d illustrate the experimental performance reached by Greedy, 
Fibonacci and PlasmaTree algorithms using the TT (Triangle on top of trian- 
gle) kernels. In both cases, double or double complex precision, the performance 
of Greedy is better than PlasmaTree even for the best choice of domain size. 



Moreover, as expected from the analysis in §3.2.2[ Greedy outperforms Fibonacci 
the majority of the time. Furthermore, we see that, for rectangular matrices, the 
experimental performance in double complex precision matches the upper bound on 
performance. This is not the case for double precision because communications have 
higher impact on performance. 

While it is apparent that Greedy does achieve higher levels of performance, the 
percentage may not be as obvious. To that end, taking Greedy as the baseline, we 



present in Figure [37f| the theoretical, double, and double complex precision overhead 
for each algorithm that uses the Triangle on top of triangle kernel as compared to 
Greedy. These overheads are respectively computed in terms of critical path length 



and time. At a smaller scale (Figure 3.13), it can be seen that Greedy can perform 



up to 13.6% better than PlasmaTree. 
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Best domain size for PlasmaTree (TT) =[ 1 3 5 5 5 10 10 10 10 10 20 ... 20 j 




(a) Theoretical CP length 



Best domain size for PlasmaTree (TT) = [1 5 5 5 17 28 8] Best domain size for PlABMaTree (TT) = [1 3 10 5 17 27 19] 




(b) Experimental (double complex) (c) Experimental (double) 

Figure 3.8: Overhead in terms of critical path length and time with respect to 
Greedy (Greedy = 1) 

For all matrix sizes considered, p = 40 and 1 < q < 40, in the theoretical model, 
the critical path length for Greedy is either the same as that of PlasmaTree 
(q = 1) or is up to 25% shorter than PlasmaTree (g = 6). Analogously, the critical 
path length for Greedy is at least 2% to 27% shorter than that of Fibonacci. In the 
experiments, the matrix sizes considered were p = 40 and q G {1, 2, 4, 5, 10, 20, 40}. In 
double precision, Greedy has a decrease of at most 1.5% than the best PlasmaTree 
(q = 1) and a gain of at most 12.8% than the best PLASMATREE (q = 5). In 
double complex precision, Greedy has a decrease of at most 1.5% than the best 
PlasmaTree (q = 1) and a gain of at most 13.6% than the best PlasmaTree 
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(q = 2). Similarly, in double precision, Greedy provides a gain of 2.6% to 28.1% 
over Fibonacci and in double complex precision, Greedy has a decrease of at most 
2.1% and a gain of at most 28.2% over Fibonacci. 

Although it is evidenced that PlasmaTree does not vary too far from Greedy 
or Fibonacci, one must keep in mind that there is a tuning parameter involved and 
we choose the best of these domain sizes for PlasmaTree to create the composite 
result, whereas with Greedy, there is no such parameter to consider. Of particular 
interest is the fact that Greedy always performs better than any other algorithm^] 
for p ^> q. In the scope of PlasmaTree, a domain size BS = 1 will force the use of 
a binary tree so that both Greedy and PlasmaTree behave the same. However, 
as the matrix tends more to a square, i.e., q tends toward p, we observe that the 
performance of all of the algorithms, including FlatTree, are on par with Greedy. 
As more columns are added, the parallelism of the algorithm is increased and the 
critical path becomes less of a limiting factor, so that the performance of the kernels 
is brought to the forefront. Therefore, all of the algorithms are performing similarly 
since they all share the same kernels. 




(a) Factorization kernels 




(b) Update kernels 



Figure 3.9: Kernel performance for double complex precision 



2 When q = I, Greedy and FlatTree exhibit close performance. They both perform a binary 
tree reduction, albeit with different row pairings. 
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TSMQR (in) 




ORMQR (in) 




TTMQR (in) 




ORMQR 1 TTMQR (in) 




GEMM (in) 




TSMQR (out) 




ORMQR (ont) 




TTMQR (out) 




ORMQR ■ TTMQR (out) 




GEMM (out) 



(a) Factorization kernels (b) Update kernels 

Figure 3.10: Kernel performance for double precision 



In order to accurately assess the impact of the kernel selection towards the per- 



formance of the algorithms, Figures 3.9 and 3.10 show both the in cache and out 
of cache performance using the No Flush and MultCallFlushLRU strategies as pre- 
sented in [291 [51]. Since an algorithm using TT kernels will need to call GEQRT 
as well as TTQRT to achieve the same as the TS kernel TSQRT, the comparison 
is made between GEQRT + TTQRT and TSQRT (and similarly for the updates). 
For rib = 200, the observed ratio for in cache kernel speed for TSQRT to GEQRT 
+ TTQRT is 1.3374, and for TSMQR to UNMQR + TTMQR is 1.3207. For out 
of cache, the ratio for TSQRT to GEQRT + TTQRT is 1.3193 and for TSMQR 
to UNMQR + TTMQR it is 1.3032. Thus, we can expect about a 30% difference 
between the selection of the kernels, since we will have instances of using in cache 
and out of cache throughout the run. Most of this difference is due to the higher 
efficiency and data locality within the TT kernels as compared to the TS kernels. 

Having seen that kernel performance can have a significant impact, we also com- 
pare the TT based algorithms to those using the TS kernels. The goal is to provide 
a complete assessment of all currently available algorithms, as shown in Figure |3.11 



For double precision, the observed difference in kernel speed is 4.976 GFLOP/sec for 
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Best domain size for PlasmaTree (TS) = [ 1 1 1 1 5 5 5 5 5 5 10 ... 10 ] 
Best domain size for PlasmaTree (TT) =[ 1 3 5 5 5 10 10 111 10 1(1 20 ... 20 ] 



Best domain size for PlasmaTree (TS) = [1 3 5 10 11 8 4] 
Best domain size for PlasmaTree (TT) = [1 5 5 5 17 28 8] 
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(a) Upper bound (double complex) 



Best domain size for PlasmaTree (TS) = [ 1 1 1 1 5 5 5 5 5 5 10 ... 10 ] 
Best domain size for PLASMATREE (TT) = [ 1 3 5 5 5 10 10 10 10 10 20 ... 20 ] 
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(c) Upper bound (double) 
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(b) Experimental (double complex) 



Best domain size for PLASMATREE (TS) = [1 3 6 11 12 18 32] 
Best domain size for PLASMATREE (TT) = [1 3 10 5 17 27 1?)] 
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(d) Experimental (double) 



Figure 3.11: Upper bound and experimental performance of QR factorization - All 
kernels 



the TS kernels versus 3.844 GFLOP/sec for the TT kernels which provides a ratio 
of 1.2945 and is in accordance with our previous analysis. It can be seen that as the 
number of columns increases, whereby the amount of parallelism increases, the effect 
of the kernel performance outweighs the benefit provided by the extra parallelism 
afforded through the TT algorithms. Comparatively, in double complex precision, 
Greedy does perform better, even against the algorithms using the TS kernels. As 
before, one must keep in mind that Greedy does not require the tuning parameter 
of the domain size to achieve this better performance. 
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(b) Experimental (double complex) (c) Experimental (double) 

Figure 3.12: Overhead in terms of critical path length and time with respect to 
Greedy (Greedy = 1) 

From these experiments, we showed that in double complex precision, Greedy 
demonstrated better performance than any of the other algorithms and moreover, 
it does so without the need to specify a domain size as opposed to the algorithms 
in PLASMA. In addition, in double precision, for matrices where p 3> q, Greedy 
continues to excel over any other algorithm using the TT kernels, and continues to 
do so as the matrices become more square. 

3.4 Conclusion 
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Best domain size for PlasmaTree (TS) = [1111555555 10 ... 10 ] 
Best domain size for PlasmaTree (TT) = [ 1 3 5 5 5 10 10 10 10 10 2(1 .... 20 




1 2 3 4 5 6789 10 20 30 40 

1 



(a) Theoretical CP length 



Best domain size for PLASMATREE (TS) = [1 3 5 10 11 8 4] Best domain size for PLASMATREE (TS) = [l 3 6 11 12 IS 32] 

Best domain size for PLASMATREE (TT) = [155 5 17 28 8] Best domain size for PLASMATREE (TT) = [1 3 11) 5 17 27 19] 




(b) Experimental (double complex) (c) Experimental (double) 



Figure 3.13: Overhead in terms of critical path length and time with respect to 
Greedy (Greedy = 1) 

In this chapter, we have presented Fibonacci, and Greedy, two new algorithms 
for tiled QR factorization. These algorithms exhibit more parallelism than state- 
of-the-art implementations based on reduction trees. We have provided accurate 
estimations for the length of their critical path. 

Comprehensive experiments on multicore platforms confirm the superiority of 

the new algorithms for p x q matrices, as soon as, say, p > 2q. This holds true 

when comparing not only with previous algorithms using TT (Triangle on top of 

triangle) kernels, but also with all known algorithms based on TS (Triangle on top 

of square) kernels. Given that TS kernels offer more locality, and benefit from better 
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elementary arithmetic performance, than TT kernels, the better performance of the 
new algorithms is even more striking, and further demonstrates that a large degree 
of a parallelism was not exploited in previously published solutions. 

Future work will investigate several promising directions. First, using rectangular 
tiles instead of square tiles could lead to efficient algorithms, with more locality and 
still the same potential for parallelism. Second, refining the model to account for 
communications, and extending it to fully distributed architectures, would lay the 
ground work for the design of MPI implementations of the new algorithms, unleashing 
their high level of performance on larger platforms. Finally, the design of robust 
algorithms, capable of achieving efficient performance despite variations in processor 
speeds, or even resource failures, is a challenging but crucial task to fully benefit from 
future platforms with a huge number of cores. 
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Algorithm 3.4: Greedy algorithm via TT kernels. 



1 for j = 1 to q do 

/* nz(j) is the number of tiles which have been eliminated in 
column j */ 

nZ(j) = 

/* nT(j) is the number of tiles which have been triangularized 
in column j */ 

nT(j) = 

4 while column q is not finished do 
for j — q down to 1 do 
if j == 1 then 

/* Triangularize the first column if not yet done */ 

nT ncw = nT(j) + (p - nT(j)) 
if p — nT(j) > then 
for k — p down to 1 do 
GEQRT(k,j) 
for jj = j + 1 to q do 
L UNMQR(k,j,jj) 



else 

/* Triangularize every tile having a zero in the 

previous column */ 

nT new = nZ(j-l) 
for k = nT(j) to nT ncw — 1 do 
GEQRT{p - k,j) 
for jj = j + 1 to 5 do 
L UNMQR(p-k,j,jj) 

/* Eliminate every tile triangularized in the previous step 
*/ 

= nZ{j) + L nT(j) - nZ(j) j 



nZ, 



1 do 



for kk = nZ(j) to nZ nc 

piv(p — kk) = p — kk — nZ ncw + nZ(j) 
TTQRT(p — kk,piv(p — kk),j) 
for jj = j + 1 to 5 do 
L TTMQR(p - kk,piv{p - kk),j,jj) 

/* Update the number of triangularized and eliminated tiles 
at the next step */ 

nT(j) = nT Ilcw 
nZ(j) = nZ DCW 



62 



Table 3.3: Time-steps for tiled algorithms. 





(a) 


Sameh-Kuck 




(b) Fibonacci 


(c) Greedy 


(d) BinaryTree 


(e) PlasmaTree (BS 
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★ 








* 
12 
10 
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* 
6 
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34 
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56 
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14 


46 


62 


78 
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16 


52 


68 


84 
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10 


40 


62 


86 


112 


136 


8 


34 


56 


78 
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128 


8 


28 


78 


102 


138 


158 


6 


54 


74 


90 
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122 


18 


58 


74 


90 
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122 


8 


36 


62 


84 


108 


134 


8 


30 


52 


78 


100 


122 


6 


42 


62 


112 


136 


172 


8 


28 


82 


102 


118 
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20 


64 


80 


96 


112 


128 


8 


34 


58 


84 
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6 


28 


50 


72 
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12 


40 


76 


96 
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34 


50 
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146 


22 


70 


86 


102 
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8 


34 


56 
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28 


50 


72 


94 


116 


6 


46 


74 


110 


130 
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56 


72 
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158 


24 


76 


92 
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124 
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56 


78 
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6 


28 


50 


68 


94 
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8 


28 


80 


108 


144 


164 


16 


52 


68 


84 


100 


166 


26 


82 


98 


114 


130 
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6 


28 


56 


78 
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6 


28 


44 


66 


88 


110 


6 


36 


56 


114 


142 


178 


6 


56 


80 


96 


112 


128 


28 


88 


104 


120 


136 


152 


6 


28 


50 


78 


100 


122 


6 


22 


44 


66 


88 


110 


10 


34 


64 


84 


148 


176 


8 


28 


84 


108 


124 


140 


30 


94 


110 


126 


142 


158 


6 


28 


44 


72 


100 


122 


6 


22 


44 


60 


82 


104 


6 


38 


02 


92 


112 


182 


10 


34 


50 


112 


136 
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32 


100 


116 
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6 


22 


44 


60 


94 


116 


6 


22 


38 


60 


76 


98 


8 


28 


66 


90 


114 
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12 


40 


56 


72 


140 


164 



Table 3.4: Neither Greedy nor ASAP are optimal. 



(a) Greedy nor Asap are optimal. 



(b) Greedy generally outperforms Asap. 



(a) 


Greedy 


(b) Asap 


* 
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* 
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10 
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10 


40 
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10 
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36 
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56 
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8 
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28 
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6 
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22 44 
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22 
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6 


22 40 


6 


22 


38 


6 


22 38 





Q 


P 


Algorithm 


16 


32 


64 


128 


16 


Greedy 


310 








Asap 


310 








32 


Greedy 


360 


650 






Asap 


402 


656 






64 


Greedy 


374 


726 


1342 




Asap 


588 


844 


1354 




128 


Greedy 


396 


748 


1452 


2732 


Asap 


966 


1222 


1748 


2756 
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Table 3.5: Three schemes applied to a column whose update kernel weight is not an 
integer multiple of the elimination kernel weight. 



a u 


(1) 


(2) 


(3) 


6 
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13 
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8 
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10 
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6 


7 


7 
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6 


5 


5 


5 


6 


5 


5 


5 


6 


5 


5 


5 
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Table 3.6: Greedy versus PT_TT and Fibonacci (Theoretical) 



p 


q 


GREEDY 


PT_TT 


BS 


Overhead 


Gain 


Fib 


Overhead 


Gain 


40 


1 


16 


16 


1 


1 


.0000 


0.0000 


22 


1 


.3750 


0.2727 


40 


2 


54 


60 


3 


1 


.1111 


0.1000 


72 


1 


.3333 


0.2500 


40 


3 


74 


98 


5 


1 


.3243 


0.2449 


94 


1 


.2703 


0.2128 


40 


4 


104 


132 


5 


1 


.2692 


0.2121 


116 


1 


.1154 


0.1034 


40 


5 


126 


166 


5 


1 


.3175 


0.2410 


138 


1 


.0952 


0.0870 


40 


6 


148 


198 


10 


1 


.3378 


0.2525 


160 


1 


.0811 


0.0750 


40 


7 


170 


226 


10 


1 


.3294 


0.2478 


182 


1 


.0706 


0.0659 


40 


8 


192 


254 


10 


1 


.3229 


0.2441 


204 


1 


.0625 


0.0588 


40 


9 


214 


282 


10 


1 


.3178 


0.2411 


226 


1 


.0561 


0.0531 


40 


10 


236 


310 


10 


1 


.3136 


0.2387 


248 


1 


.0508 


0.0484 


40 


11 


258 


336 


20 


1 


.3023 


0.2321 


270 


1 


.0465 


0.0444 


40 


12 


280 


358 


20 


1 


.2786 


0.2179 


292 


1 


.0429 


0.0411 


40 


13 


302 


380 


20 


1 


.2583 


0.2053 


314 


1 


.0397 


0.0382 


40 


14 


324 


402 


20 


1 


.2407 


0.1940 


336 


1 


.0370 


0.0357 


40 


15 


346 


424 


20 


1 


.2254 


0.1840 


358 


1 


.0347 


0.0335 


40 


16 


368 


446 


20 


1 


.2120 


0.1749 


380 


1 


.0326 


0.0316 


40 


17 


390 


468 


20 


1 


.2000 


0.1667 


402 


1 


.0308 


0.0299 


40 


18 


412 


490 


20 


1 


.1893 


0.1592 


424 


1 


.0291 


0.0283 


40 


19 


432 


512 


20 


1 


.1852 


0.1562 


446 


1 


.0324 


0.0314 


40 


20 


454 


534 


20 


1 


.1762 


0.1498 


468 


1 


.0308 


0.0299 


40 


21 


476 


554 


20 


1 


.1639 


0.1408 


490 


1 


.0294 


0.0286 


40 


22 


498 


570 


20 


1 


.1446 


0.1263 


512 


1 


.0281 


0.0273 


40 


23 


520 


586 


20 


1 


.1269 


0.1126 


534 


1 


.0269 


0.0262 


40 


24 


542 


602 


20 


1 


.1107 


0.0997 


556 


1 


.0258 


0.0252 


40 


25 


564 


618 


20 


1 


.0957 


0.0874 


578 


1 


.0248 


0.0242 


40 


26 


586 


634 


20 


1 


.0819 


0.0757 


600 


1 


.0239 


0.0233 


40 


27 


608 


650 


20 


1 


.0691 


0.0646 


622 


1 


.0230 


0.0225 


40 


28 


630 


666 


20 


1 


.0571 


0.0541 


644 


1 


.0222 


0.0217 


40 


29 


652 


682 


20 


1 


.0460 


0.0440 


666 


1 


.0215 


0.0210 


40 


30 


668 


698 


20 


1 


.0449 


0.0430 


688 


1 


.0299 


0.0291 


40 


31 


684 


714 


20 


1 


.0439 


0.0420 


710 


1 


.0380 


0.0366 


40 


32 


700 


730 


20 


i 
i 




0.0411 


732 


i 
i 




0.0437 


40 


33 


716 


746 


20 


1 


.0419 


0.0402 


754 


1 


.0531 


0.0504 


40 


34 


732 


762 


20 


1 


.0410 


0.0394 


776 


1 


.0601 


0.0567 


40 


35 


748 


778 


20 


1 


.0401 


0.0386 


798 


1 


.0668 


0.0627 


40 


36 


764 


794 


20 


1 


.0393 


0.0378 


820 


1 


.0733 


0.0683 


40 


37 


780 


810 


20 


1 


.0385 


0.0370 


842 


1 


.0795 


0.0736 


40 


38 


796 


826 


20 


1 


.0377 


0.0363 


862 


1 


.0829 


0.0766 


40 


39 


812 


842 


20 


1 


.0369 


0.0356 


878 


1 


.0813 


0.0752 


40 


40 


826 


856 


20 


1 


.0363 


0.0350 


892 


1 


.0799 


0.0740 
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Table 3.7: Greedy versus PT_TT (Experimental Double) 



p 


q 


UKrvhviJ Y (dj 


rill (d) 


Bb 


Overhead 


Lram 


40 


1 


36.9360 


37.5020 


1 


1.0153 


-0.0153 


40 


2 


58.5090 


52.7180 


3 


0.9010 


0.0990 


40 


4 


103.2670 


90.7940 


10 


0.8792 


0.1208 


40 


5 


115.3060 


100.5540 


5 


0.8721 


0.1279 


40 


10 


153.5180 


145.8200 


17 


0.9499 


0.0501 


40 


20 


170.8730 


171.8270 


27 


1.0056 


-0.0056 


40 


40 


184.5220 


182.8160 


19 


0.9908 


0.0092 



Table 3.8: Greedy versus PT_TT (Experimental Double Complex) 



p 


q 


GREEDY (z) 


PT_TT(z) 


BS 


Overhead 


Gain 


40 


l 


42.0710 


42.7120 


1 


1.0152 


-0.0152 


40 


2 


60.4420 


52.1970 


5 


0.8636 


0.1364 


40 


4 


95.1820 


84.1120 


5 


0.8837 


0.1163 


40 


5 


107.6370 


96.7530 


5 


0.8989 


0.1011 


40 


10 


135.0270 


128.4320 


17 


0.9512 


0.0488 


40 


20 


144.4010 


146.4220 


28 


1.0140 


-0.0140 


40 


40 


152.9280 


151.9090 


8 


0.9933 


0.0067 



Table 3.9: Greedy versus Fibonacci (Experimental Double) 



p 


q 


GREEDY(d) 


FIB(d) 


Overhead 


Gain 


40 


1 


36.9360 


26.5610 


0.7191 


0.2809 


40 


2 


58.5090 


49.4870 


0.8458 


0.1542 


40 


4 


103.2670 


100.1440 


0.9698 


0.0302 


40 


5 


115.3060 


115.0020 


0.9974 


0.0026 


40 


10 


153.5180 


152.0090 


0.9902 


0.0098 


40 


20 


170.8730 


170.4780 


0.9977 


0.0023 


40 


40 


184.5220 


180.2990 


0.9771 


0.0229 



Table 3.10: Greedy versus Fibonacci (Experimental Double Complex) 



p 


q 


GREEDY(z) 


FIB(z) 


Overhead 


Gain 


40 


1 


42.0710 


30.2280 


0.7185 


0.2815 


40 


2 


60.4420 


48.9570 


0.8100 


0.1900 


40 


4 


95.1820 


97.1650 


1.0208 


-0.0208 


40 


5 


107.6370 


105.9610 


0.9844 


0.0156 


40 


10 


135.0270 


134.5500 


0.9965 


0.0035 


40 


20 


144.4010 


145.5530 


1.0080 


-0.0080 


40 


40 


152.9280 


150.0980 


0.9815 


0.0185 



4. Scheduling of Cholesky Factorization 

In Chapter [2] we studied the Cholesky Inversion algorithm which consists of the 
three steps: Cholesky factorization, inversion of the factor, and the multiplication of 
two triangular matrices. In this chapter, we will focus on the Cholesky factorization 
but unlike the previous work where the number of processors was unbounded, we will 
consider the factorization in the context of a finite number of processors. By limiting 
the number of processors, the scheduling of the tasks becomes an issue. Moreover, the 
weight (processing time) of each task must be taken into consideration when creating 
the schedule. 

As before, we will be considering the critical path length for the algorithm but 
not as a function of the number of tiles rather as a function of the weights of the 
tasks. The weights are based upon the total computational cost for each kernel and 



are provided in Table |4.1| A more in-depth analysis of the length of the critical path 
with weighted tasks for the Cholesky Inversion algorithm can be found in [TB] which 
also provides 9t — 10 as the weighted critical path length for the Cholesky factorization 
of a matrix of t x t tiles. 

Table 4.1: Task Weights 



Weights 
# flops (in flops) 

POTRF \nl + 0{n 2 b ) 1 

TRSM n\ 3 

SYRK nl + 0{n 2 b ) 3 

GEMM 2nf + 0{n 2 b ) 6 



The upper bound on performance of perfect speedup and critical path introduced 
in Chapter [2] remains too optimistic and does not take into account any information 
which can be garnered from the DAG of the algorithm. This work makes progress 
towards providing a more representative bound on the performance of the Cholesky 
factorization in the tiled setting. 
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We also provide gains toward a bound on the minimum number of processors 
required to obtain the shortest possible weighted critical path (minimum make span) 
for the Cholesky factorization for a matrix of t x t tiles. 

4.1 ALAP Derived Performance Bound 

To obtain our bounds, we calculate the latest possible start time for each task 
(ALAP) and consider an unbounded number of processors without any costs for 
communication. If we did account for communication, we might see the critical path 
length increase which would in turn decrease our upper bound. We start at the final 
tasks and consider how many processors are needed to execute these tasks without 
increasing the length of the critical path. We step backwards in time until such a 
point that there are more processors needed to keep the critical path length constant. 
Thus we must add enough processors to execute the tasks and in turn create more 
idle time for the execution of tasks which are successors. At a certain point, there is 
no more need to add processors and this is then the number of processors needed to 
obtain the constant length critical path. 

By forcing as late as possible (ALAP) start times, any schedule will keep as 
many or fewer processors active as the ALAP execution on an unbounded number of 
processors. Thus by evaluating the Lost Area (LA), or idle time, for a given number 
of processors, p, at the end of the ALAP execution on an unbounded number of 
processors, we can increase the sequential time by the amount of LA and divide this 
result by the p to obtain the best possible execution time, i.e., 

T p = Tseq ± LAp (4.1) 

and we define this to be the ALAP Derived Performance Bound. Hence the maximum 
speedup that we can expect is given by 



T E. 

J- seq rp , j a ■ 

- 1 seq \ ■ LJjr *- , p 
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An example will help to further illustrate this technique. In Figure 4.1 we are 
given the ALAP execution of a 5 x 5 tiled matrix which has T seq = 125. The ordered 
pairs indicated provide the number of processors and idle time, respectively, and 



in Table 4.2 are given the values for T p , speedup, and efficiency. For more than 
four processors, there are enough processors to obtain the critical path length which 
becomes our limiting factor. 



cn 7 



« 



(5,45) 



■ POTRF 

■ TRSM 
SYRK 

■ GEMM 



(4,24) 




~l ' 1 ' 2 ' 3 ' 4 ' 5 ' 6 ' 7 ' S ' 9 ' lo' ll' 12' 13' 14' 15' lfi' VP is' IE)' 20' 2l' 22' 23' 24' 25' 26' 27' 2s' 29' 30' 3l' 32' 33' 34 

time 

Figure 4.1: ALAP execution for 5 x 5 tiles 
Table 4.2: Upper bound on speedup and efficiency for 5 x 5 tiles 



V 


T 


Sp 


E p 


1 


125.00 


1.00 


1.00 


2 


64.50 


1.94 


0.97 


3 


45.33 


2.76 


0.92 


4 


37.25 


3.36 


0.84 


5 


35.00 


3.57 


0.71 


6 


35.00 


3.57 


0.60 


7 


35.00 


3.57 


0.51 


8 


35.00 


3.57 


0.45 


9 


35.00 


3.57 


0.40 


10 


35.00 


3.57 


0.36 



4.2 Critical Path Scheduling 

In order to provide a critical path schedule, we use the Backflow algorithm to 

assign priorities to tasks of a DAG such that each task's priority adheres to its de- 
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pendencies. The algorithm is described in four steps: 

STEP 1 : Beginning at the final task in the DAG, set its priority 

to its processing time. 

STEP 2 : Moving in the reverse direction, set each incidental 
task's priority to the sum of its the processing time and 
the final task's priority. 

STEP 3 : For each task in STEP 2, moving in the reverse direc- 
tion, set each incidental task's priority to the sum of 
its processing time and the maximum priority of any 
incidental successor task. 

STEP 4 : Repeat the procedure until all tasks have been assigned 
a priority. 



An example is given in Figure The processing times are given in parenthesis 
and the assigned priorities (cp) are designated in square brackets. Tasks A and B 
will be assigned a priority of 16 since cp(A) = 3 + max (cp(C), cp(£))) and cp(B) = 
3 + max (cp(.D), cp(E)). 

By following the path with the highest priorities, a critical path can be discerned 
from the weighted DAG. Thus any schedule which then chooses from the available 
tasks those with the highest priorities to execute first inherently follows the critical 
path. It is well known that a critical path scheduling is not always optimal. As an 
example |43j . take two processors and four tasks. Let tasks A, B, C, and D have 
weights of 3, 3, 1, and 1, respectively and let the only relationship between tasks be 
that C is a predecessor of D. Then cp(A) = cp(£>) = 3, cp(C) = 2, and cp(D) = 1. 
A critical path schedule would choose to schedule tasks A and B simultaneously and 
follow up with C and then D, resulting in a schedule of length five. However, if A 
and C are scheduled simultaneously and then D follows A on the same processor and 
B follows C on the other processor, the length of the schedule is four. 
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END 


(1) 




(a) State at start (b) State at finish 

Figure 4.2: Example derivation of task priorities via the Backflow algorithm 

We will use this critical path information to analyze three schedules by choosing 
available tasks via max(cp), rand(cp), or min(cp). The max(cp) will naturally follow 
the critical path by scheduling tasks with the highest cp first and, vice versa, the 
min(cp) will schedule from the available tasks those with the minimum cp. Between 
these two, we also choose randomly amongst the available tasks with rand(cp). 

4.3 Scheduling with synchronizations 

The right-looking version of the LAPACK Cholesky factorization as depicted 



in Figure |l.lc| provides an alternative schedule which can be easier to analyze and 
understand. We will apply the three steps of the algorithm to a matrix of t x t 
tiles. In the tiled setting, we can provide synchronization points between the varying 
tasks of each step and simply schedule any of the available tasks since there are no 
dependencies between the tasks in each grouping. By adding these synchronizations, 
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this schedule is not able to obtain the critical path no matter how many processors 
are available. Algorithm 4.1 is the right-looking variant of the Cholesky factorization 
with added synchronization points. 



Algorithm 4.1: Schedule for tiled right-looking Cholesky factorization with 
added synchronizations to allow for grouping. 

1 Tile Cholesky Factorization (compute L such that A = LL T ); 



2 for j 


= to t - 1 do 


3 


schedule POTRF(i) ; 


4 


synchronize; 


5 


for j ' = i + 1 to t — 1 do 


6 


[ schedule TRSM(j,i) ; 


7 


synchronize; 


8 


for j ' = i + 1 to t — 1 do 


9 




for k = i + ltoj — 1 


10 




schedule GEMM(j. 


11 


synchronize; 


12 


for j ' = i + 1 to t — 1 do 


13 


[ schedule SYRK(j,i) ; 


14 


synchronize; 



Naturally, we can improve upon the above schedule by removing the synchro- 



nization between some of the groupings (Algorithm 4.2). The update of the trailing 
matrix is composed of two groupings, namely the GEMMs and the SYRKs, which 
can be executed in parallel if enough processors are available. Moreover, the added 
synchronization point between the update of the trailing matrix and the factorization 
of the next diagonal tile may also be removed. This schedule does become more com- 
plex, but given enough processors, the schedule is able to obtain the critical path as 
the limiting factor to performance. The minimum number of processors, p, needed to 
obtain the critical path is p = — l) 2 ] for a matrix of t x t tiles since the highest 
degree of parallelism is realized for the update of the first trailing matrix which is of 
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size (t-1) x (t- 1). 

Both of these schedules differ from the critical path scheduling due to the added 
synchronization points and will show lower theoretical performance. In the theoretical 



results, we only show Algorithm 4.2 



Algorithm 4.2: Improvement upon Algorithm 4.1 



1 Tile Cholesky Factorization (compute L such that A = LL T ); 

2 for j ' = to t — 1 do 
schedule POTRF(i) ; 
synchronize; 

for j ' = i + 1 to t — 1 do 

[ schedule TRSM(j,i) ; 

synchronize; 

for j ' = i + 1 to t — 1 do 
for k — i + 1 to j — 1 do 

schedule GEMM(j,i,k) ; 

schedule SYRK(j,i) ; 



5 
6 

7 

8 
9 
10 

11 



4.4 Theoretical Results 

In the following figures, our Rooftop bound will be that which uses the perfect 
speedup until the weighted critical path is the limiting factor, i.e., 

Rooftop Bound = max ^Critical Path, T -f) (4.2) 

We will compare this to our ALAP Derived bound which was derived using the ALAP 
execution, our various scheduling strategies, and the following lower bound. From [201 
p. 221, §7.4. 2], given our DAG, we know that the make span, MS, of any list schedule, 
a, for a given number of processors, p, is 

MS(a,p)< U-^jMS^p) 

where MS op t is the make span of the optimal list schedule without communication 

costs. However, we do not know MS opt and must therefore substitute the make span 
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of the Critical Path Scheduling using the maximum strategy to compute our lower 
bound. 

The ALAP Derived bound improves upon the Rooftop bound precisely in the area 
that is of most concern, namely where there is enough parallelism but not enough 
processors to fully exploit that parallelism. 



Speedup with p=40 



Efficiency with p=40 




- Rooftop bound 

- ALAP Derived bound 
Critcal Path 

- Critcal Path random 

- Critcal Path min 
-Algorithm 4.2 

(2- 1/p)*MS 




- Rooftop bound 

- ALAP Derived bound 
Critcal Path 

- Critcal Path random 

- Critcal Path min 



(a) Speedup 



(b) Efficiency 



Scalability with p=40 



Critical Path 

- Critical Path random 

- Critical Path min 
■■ Algorithm 4.2 




(c) Comparison to new bound 
Figure 4.3: Theoretical results for matrix of 40 x 40 tiles. 



Figure |4.3| (a) shows that the Critical Path schedule is actually quite descent and 
comes to within two percent of the ALAP Derived bound. Moreover, the ALAP 
Derived bound has significantly reduced the gap between the Rooftop bound and any 
of our list schedules. 
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4.5 Toward an a 



opt 



It is interesting to know how many processors one needs to be able to schedule 
all of the tasks and maintain the weighted critical path. We will view this problem 
in terms of tiles and state the problem as follows: 

Given a matrix oftxt tiles, determine the minimum number of processors, 
p opt , needed to schedule all tasks and achieve an execution time equal to 
the weighted critical path. 



Toward that end, we will let p = at 2 where < a < 1. Our analysis will be 



asymptotic in which we let t tend to infinity. From the analysis of Algorithm 4.2 



we already know that a op t < \. Using MATLAB, we have calculated the a value for 
matrices of t x t tiles as shown in Figure 



4.4 



It is our conjecture that a opt 



Values ofccfor3<t<40 



- ALAP Derived 

Critical Path ma 
» % difference 



Figure 4.4: Values of a for matrices of t x t tiles where 3 < t < 40 



4.6 Related Work 

For the LU decomposition with partial pivoting, much work has been accom- 
plished to discern asymptotically optimal algorithms for all values of a [37J ESJ H3] . 
They consider a problem of size n and assume p = an processors on which to schedule 
the LU decomposition. The critical path length of the optimal schedule in this case 
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Efficiency vesus the ratio a = p / n 



Efficiency vesus the ratio ct = p 1 1 2 








Rooftop bound 

ALAP Derived bound 

Critcal Path max 


^— ^— upper bound 
optimal algorithm 







(a) LU Decomposition 



(b) Tiled Choelsky factorization 



Figure 4.5: Asymptotic efficiency versus a = p/n for LU decomposition and versus 
a = p/t 2 for Tiled Cholesky factorization. 



is n 2 and a opt ~ 0.347 where a opt is a solution to the equation 3a — a 3 = 1. 



In Figure 4.5, we make a comparison between the algorithmic efficiency of the LU 
decomposition and the tiled Cholesky factorization. In the case of the LU decompo- 
sition, the attainable upper bound on efficiency closely resembles our previous bound 
of perfect speedup which is then limited by the critical path. On the other hand, the 
tiled Cholesky factorization does not exhibit this type of efficiency which can be seen 
from the gap between our Rooftop bound and the ALAP Derived bound. Unlike the 
work in |37J, we do not provide an algorithm which attains the ALAP Derived bound. 
4.7 Conclusion and future work 

In many research papers, the performance of an algorithm is usually compared to 
either the performance of the GEMM kernel or against perfect scalability resulting in 
large discrepancies between the peak performance of the algorithm and these upper 
bounds. If an algorithm displays a DAG such as that of the tiled Cholesky factor- 
ization, it is unrealistic to expect perfect scalability or even make comparisons to the 
performance of the GEMM kernel. Thus one needs to consider a new bound which 
is more representative of the algorithm and accounts for the structure of the DAG. 

Without such a bound it is difficult to access whether there are any performance 
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gains to be achieved. Although we do not have a closed-form expression for this new 
bound, we have shown that such a bound exists. Moreover, we have also shown that 
any algorithm which schedules the tiled Cholesky factorization while maintaining the 
weighted critical path will require 0(t 2 ) processors for a matrix of t x t tiles and the 
coefficient is somewhere around 0.2. 

In this chapter, we have applied a combination of existing techniques to a tiled 
Cholesky factorization algorithm to discover a more realistic bound on the perfor- 
mance and efficiency. We did so by considering an ALAP execution on an unbounded 
number of processors and used this information to calculate the idle time for any 
list schedule on a bounded number of processors. This is then used to calculate the 
maximum speedup and efficiency that may be observed. 

Further work is necessary to provide a closed-form expression of the new bound 
dependent upon the number of processors. In addition, we need to include communi- 
cation costs in the bound to make it more reflective of the actual scheduling problem 
on parallel machines. As can be seen in Figure 4.3[ c), the Critical Path Schedule is 
within 2% of our ALAP Derived bound. Although scheduling a DAG on a bounded 
number of processors is an NP-complete problem, it may be not be the case for the 
DAG of the tiled Cholesky factorization. Pursuant investigation might show that the 
Critical Path Scheduling is the optimal schedule. 
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5. Scheduling of QR Factorization 

In this chapter, we present collaborative work with Jeffrey Larson. We revisit 
the tiled QR factorization as presented in Chapter [3] but do so in the context of a 
bounded number of resources. (Chapter [3] was concerned with finding the optimal 
elimination tree on an unlimited number of processors.) We will be using the same 
analytical tools as in Chapter [4] to derive good schedules and to improve upon the 
Rooftop bound. The Cholesky factorization has just one DAG, therefore Chapter [4] 
is a standard scheduling problem, i.e., how to schedule a DAG on a finite number 
of processor. Unlike the previous chapter, we will need to consider all of the various 
algorithms (i.e., elimination trees) and cannot distill the analysis down to a single 
DAG. 

5.1 Performance Bounds 

Each of the algorithms studied in Chapter [3j namely FlatTree, Fibonacci, 
Greedy, and GrASAP, produces a unique DAG for a matrix of p x q tiles. In turn, 



the ALAP Derived bounds (|4.1|) for each elimination tree will also be unique. In Fig- 
ure 



5.1 we give the computed upper bounds and make comparisons to the scheduling 



strategies of maximum, random, and minimum via the Critical Path Method for a 
matrix of 20 x 6 tiles. The matrix size was chosen such that the critical path length 



of Greedy is 136 and the critical path length of GrASAP is 134 (see Figure 3.5 



in 



3.2.2). 



The GrASAP algorithm for a tiled matrix is optimal in the scope of unbounded 
resources. However by the manner in which the ALAP Derived bound is computed, 
the bound created by using GrASAP cannot hold for all of the other algorithms. 
Consider the ALAP execution of the Fibonacci and GrASAP algorithms on a 



matrix of 15 x 4 tiles. In Figure |5.2[ we show the execution of the last tasks for 
GrASAP on the left and Fibonacci on the right. More of the tasks in the ALAP 
execution for Fibonacci are pushed towards the end of the execution which means 
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Speedup with p=20, q=6 
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Rooftop bound . 

— ALAP Derived bound 

Crilcal Path max 
^— ^ Critcal Path random 
Crilcal Path min 



24 32 40 48 



(a) FlatTree 



Speedup with p=20. q=6 




Rooftop bound 
ALAP Derived bound 
Critcal Path 
Critcal Path random 
Critcal Path min 



(c) Greedy 




Roottop bound 
ALAP Derived bound 
Critcal Path 
Critcal Path random 
Critcal Path min 



(b) Fibonacci 




Roottop bound 
ALAP Derived bound 
Critcal Path 
Critcal Path random 
Critcal Path min 



(d) GrASAP 



Figure 5.1: Scheduling comparisons for each of the algorithms versus the ALAP 
Derived bounds on a matrix of 20 x 6 tiles. 



the ALAP Derived bound will be higher than that of GrASAP for a schedule that 
uses fewer than 10 processors. In other words, as we add more processors, the Lost 
Area (LA) increases much faster for GrASAP than it does for Fibonacci. Since 
the critical path length for Fibonacci is greater than that of GrASAP, after a 
certain number of processors, the ALAP Derived bound for Fibonacci falls below 



that of GrASAP. These observations are evident in Figure 5.3 where we show a 



comparison of the bound for each algorithm. Recall that the Rooftop bound (4.2 ) only 



takes into account the critical path length of an algorithm such that for GrASAP 
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TTMQR 





GrASAP Fibonacci 

Figure 5.2: Tail-end execution using ALAP on unbounded resources for GrASAP 
and Fibonacci on a matrix of 15 x 4 tiles. 



it can be considered a bound for all the algorithms since GrASAP is optimal for 
unlimited resources and thus has the shortest critical path length. 
5.2 Optimality 

There is no reason for the tree found optimal in Chapter [3] on an unbounded num- 
ber of resources to be optimal on a bounded number resources. We cast the problem 
of finding the optimal tree with the optimal schedule as an integer programming 
problem. (A complete description of the formulation can be found in Appendix |Aj) 
Similarly, in a Mixed-Integer Linear Programming approach was used to provide 
an upper bound on performance. However, the integer programming problem size 
grows exponentially as the matrix size increases, thus the largest feasible problem 



size was a matrix of 5 x 5 tiles. In Figure 5.4 we show the speedup of the GrASAP 



algorithm with its bound and make comparisons to an optimal tree with an optimal 



schedule and Table 5.1 provides the actual schedule lengths for all of the algorithms 
using the CP Method for the matrix of 5 x 5 tiles.. 
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Speedup with p=1 5, q=4 



Efficiency with p=15. q=4 




Figure 
tiles. 



(a) Speedup (b) Efficiency 

5.3: ALAP Derived bound comparison for all algorithms for a matrix of 15 x 4 



5.3 Elimination Tree Scheduling 

We pair up the choice of the elimination tree with a type of scheduling strategy 
to obtain the following bounds: 



GrASAP 



< 



y Rooftop bound 
Moreover, we also have 



GrASAP 



< 



Rooftop bound 



optimal tree 
optimal schedule 



GrASAP 



< 



GrASAP 



< 



1 GrASAP ^ 



optimal schedule 



CP schedule 



< 



ALAP Derived Bound 



GrASAP 



optimal schedule 



< 



GrASAP 
CP schedule 



Combining these inequalities with Table 5.1 gives rise to the following questions: 



Given an optimal elimination tree for the tiled QR factorization on an 
unbounded number of resources 

(Ql) does there always exist a scheduling strategy such that the schedule 
on limited resources is optimal? 

(Q2) does the ALAP Derived bound for this elimination tree hold true 
for any scheduling strategy on any other elimination tree? 
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Optimality Comparison with p=5, q=5 




Integer Program 

ALAP Derived bound 

Q.I 1 1 1 1 =-J 

14 8 12 

# of threads 

Figure 5.4: Comparison of speedup for CP Method on GrASAP, ALAP Derived 
bound from GrASAP, and optimal schedules for a matrix of 5 x 5 tiles on 1 to 14 
processors 



Table 5.1: Schedule lengths for matrix of 5 x 5 tiles 



ALAP Derived Optimal CP Method 

Procs Bound(GRASAP) Tree/Schedule GrASAP Greedy Fibonacci FlatTree 



1 


500 


500 


500 


500 


500 


500 


2 


255 


256 


256 


256 


256 


256 


3 


176 


176 


178 


178 


178 


176 


4 


138 


138 


140 


140 


140 


140 


5 


116 


116 


118 


118 


118 


116 


6 


102 


104 


104 


104 


104 


104 


7 


92 


94 


94 


94 


94 


94 


8 
9 


86 
82 


88 
84 


88 
84 


88 
84 


88 
86 


88 
86 


10 


80 


80 


82 


82 


86 


86 


11 


SO 


80 


80 


SO 


86 


86 


12 


80 


80 


80 


SO 


86 


86 


13 


80 


80 


80 


so 


86 


86 


14 


80 


80 


80 


80 


80 


86 



82 



From Chapter [3] we have that GrAS AP is an optimal elimination tree for the tiled 
QR factorization. We know that the length of an optimal schedule for GrASAP 
on p processors will necessarily be greater or equal to the ALAP Derived bound for 
GrASAP on p processors by way of construction of the ALAP Derived bound. Thus 
(Ql) implies (Q2). We cannot address the first question directly since the size of 
the matrix needed to produce a counter example is too large for verification with the 
integer programming formulation. 

Therefore we need to find a matrix size for which a schedule exists whose execution 
length is smaller than the ALAP Derived bound from GrASAP on the same matrix. 
As we have seen in Figure 5.3[ the Fibonacci elimination tree on a tall and skinny 



tiled matrix provides the best hope. 

Consider a matrix of 34 x 4 tiles on 10 processors. The ALAP Derived bound 
from GrASAP is 188. Using the Critical Path method to schedule the Fibonacci 
elimination tree we obtain a schedule length of 184. Therefore the ALAP Derived 
bound from GrASAP does not hold for this schedule. By implication, we have that 
(Ql) is false. However, the Rooftop bound from GrASAP is still a valid bound for 
all of the schedules. 

5.4 Conclusion 

In this chapter we have applied the same tools used in Chapter [4] to provide 
performance bounds for the tiled QR factorization. Further, we have shown that the 
ALAP Derived bound is algorithm dependent. This leaves that only bound we have 
for all algorithms is the Rooftop bound as computed using the GrASAP algorithm. 

The analysis in this chapter has also shown that an optimal algorithm for an 
unbounded number of resources does not imply that a scheduling strategy exists such 
that it can be scheduled optimally. 
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6. Strassen Matrix-Matrix Multiplication 

Matrix multiplication is the underlying operation in many if not most of the 
applications in numerical linear algebra and as such, it has garnered much attention. 
Algorithms such as the Cholesky factorization, LU decomposition and more recently 
the QR-based Dynamically Weighted Halley iteration for polar decomposition [38J, 
spend a majority of their computational cost in matrix-matrix multiplication. The 
conventional BLAS Level 3 subprogram for matrix-matrix multiplication is of 0(n a ), 
where a = log28 = 3, computational cost but there exist subcubic computational 
cost algorithms. In 1969, Volker Strassen [18] presented an algorithm that computes 
the product of two square matrices of size n x n, where n is even, using only 7 
matrix multiplications at the cost of needing 18 matrix additions/subtractions which 
then can be called recursively for each of the 7 multiplications. This compares to 
the standard cubic algorithm which requires 8 matrix multiplications and only 4 
matrix additions. When Strassen's algorithm is applied recursively down to a constant 
size, the computational cost is 0(n a ) where a = log 2 7 ~ 2.807. Two years later, 
Shmuel Winograd proved that a minimal of 7 matrix multiplications and 15 matrix 
additions/subtractions, which is less than the 18 of Strassen's, are required for the 
product of two 2x2 matrices, see [31]. These discoveries have spawned a flurry of 
research over the years. In 1978, Pan |39j showed that a < 2.796. In the late 1970's 
and early 1980's, Bini [12] provided a < 2.78 with Schonage [16] following up by 
showing a < 2.522 but was usurped the following year by Romani [H] who discerned 
a < 2.517. In 1986, Strassen brought forth a new approach which lead to a < 2.497. 
In 1990, Coppersmith and Winograd [23] improved upon Strassen's result providing 
the asymptotic exponent a < 2.376. This final result still stands, but it is conjectured 
that a = 2 + e for any e > where e can me made as small as possible. Although 
the Copper smith- Winograd algorithm may be reasonable to implement, since the 
constant of the algorithm is huge and will not provide an advantage except for very 
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large matrices, we will not consider it and instead focus on the Strassen-Winograd 
Algorithm. 

6.1 Strassen-Winograd Algorithm 

Here we discuss the algorithm as it would be implemented to compute the product 
of two matrices A and B where the result is stored into matrix C. The algorithm is 
recursive, thus we describe one step. Given the input matrices A, B, and C, divide 
them into four submatrices, 



A 



An A 12 

A-21 A 2 2 



B 



B21 B 2 2 



C 



Cll C\2 

C21 C22 



then the 7 matrix multiplications and 15 matrix additions/subtractions are computed 



as depicted in Table pA\ and Figure pA\ shows the task graph of the Strassen-Winograd 
algorithm for one level of recursion. 

Table 6.1: Strassen-Winograd Algorithm 



Phase 1 


T x 


= A 2 i + A22 


T 5 = 


B\2 — Bn 




T 2 


= T X - A u 


n = 


B22 — T 5 




T 3 


= A u - A 2 i 


T 7 = 


B22 — B12 




T 4 


= A l2 - T 2 


T 8 = 


Tq — B21 


Phase 2 


Qi 


= T 2 x T 6 


Q 5 = 


TixT 5 




Q2 


= An x Bn 


Q 6 = 


T 4 x B 2 2 




Qs 


= A 12 x B 2 i 


Qr = 


A22 x T 8 




Qa 


= T 3 xT 7 






Phase 3 




= Qi + Q2 


C n = 


= Q2 + Q3 




u 2 


= U 1 + Q i 


C\2 = 


= U! + U 3 






— Q5 + Q& 


C2I = 


= U 2 -Q 7 








C22 ~ 


= U 2 + Q 5 



In essence, Strassen's approach is very similar to the observation that Gauss made 
concerning the multiplication of two complex numbers. The product of (a + hi)(c + 

di) = ac — bd+ (bc + ad)i would naively take four multiplications, but can actually be 
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Figure 6.1: Task graph for the Strassen-Winograd Algorithm. Execution time pro- 
gresses from left to right. Large ovals depict multiplication and small ovals addi- 
tion / subtraction. 



accomplished via three multiplications by discerning that bc + ad = (a + b) (c + d) 
be — ad. 



6.2 Tiled Strassen-Winograd Algorithm 

In our tiled version, the matrices are subdivided such that each submatrix is of 
the form 

Mijn Miji2 ■ ■ ■ Miji n 



M 



1:1 



M im M ij22 ... M,, 



ij2n 



Mij n i Mij n2 . . . Mij nn 

where the matrices M^m are tiles of size x n&. As before one can proceed with 
full recursion, unlike before this would not terminate at the scalar level, but rather it 
would terminate with the multiplication of two tiles using a sequential BLAS Level 
3 matrix-matrix multiplication. The recursion can also be cutoff at a higher level 



at which point the tiled matrix multiplication of Algorithm 6^ computes the result- 
ing multiplication. For the addition/subtraction of the submatrices in Phase 1 and 
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Phase 3 of the Strassen-Winograd algorithm, a similar approach is used which is also 
executed in parallel. 



Algorithm 6.1: Tiled Matrix Multiplication (tiled_gemm) 



/* Input: nxn tiled matrices A and B, Output: 
matrix C such that C = A x B 
1 for i = 1 to n do 



nxn tiled 



*/ 



for j = 1 to n do 
for k = 1 to n do 

Cij i An* x Bkj + Cij 



If the cutoff for the recursion occurs before the tile level, the computation for each 
can be executed in parallel. Therefore our tiled implementation of the Strassen- 
Winograd algorithm exploits two levels of parallelism. Moreover, this allows some 



parts of the matrix multiplications to occur early on as can be seen in Figure 6.2 



which shows the DAG for a matrix of 4 x 4 tiles with one level of recursion. Both 
Figure 6^2 and Figure |6.1 illustrate one level of recursion but the tiled task graph of 
a 4 x 4 tiled matrix clearly portrays the high degree of parallelism. 

The conventional matrix-matrix multiplication algorithm requires 8 multiplica- 
tions and 4 additions whereas the Strassen-Winograd algorithm requires 7 multipli- 
cations and 15 additions/subtractions for each level of recursion. Therefore, there 
are more tasks for the Strassen-Winograd algorithm as compared to the conven- 
tional matrix-matrix multiplication and it would behoove us to reduce the number 
of tasks which would also reduce the algorithmic complexity. On the other hand, 
since we are reducing the number of multiplications, the computational cost is also 
reduced since this requires a cubic operation versus the quadratic operation of the 
addition/subtractions. 

The total number of tasks, T, of the Strassen-Winograd algorithm is given by 



Figure 6.2: Strassen-Winograd DAG for matrix of 4 x 4 tiles with one recursion. 
Execution time progresses from left to right. Large ovals depict multiplication and 
small ovals addition/subtraction. 
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Table 6.2: Recursion levels which minimize the number of tasks for a tiled matrix of 
size p x p 



7 1 


T 

1 mm 


GflonfSWl 

VJ 11UU 1 U VV 1 


GflonfGEMMl 


4 


i 


8.96e-01 


1.02e+00 


8 


i 


7.15e+00 


8.18e+00 


ID 


i 


o. 1 ze+Ul 


o.ooe+Ul 


32 


i 


4.57e+02 


5.24e+02 


64 


2 


3.20e+03 


4.19e+03 


128 


3 


2.24e+04 


3.35e+04 


256 


4 


1.57e+05 


2.68e+05 


512 


5 


1.09e+06 


2.14e+06 


1024 


6 


7.69e+06 


1.71e+07 



which is minimized at recursion level r TO ,„ when 



51n(|) 



In 2 



V 



and the total number of flops, F, is given by 

F = m r(P-Y + i5aYr- 1 (JL 

\2 r J ^ \1 r - % 



mp 



bap 2 



2 r / ^ \2 r ~ i 

i=0 

where m = 2n^ — for the multiplications and a = for the additions/subtractions. 

As we increase the recursion, the number of tasks will decrease up to a certain 
point, r min . The reason for this being at each recursion we reduce the number of p 3 
tasks by | while increasing the p 2 tasks by 15. 

In our experiments, letting the tile size nj, = 200 provided the best performance. 



Thus Table 6.2 shows the corresponding values for the minimizing recursion level for 



various number of tiles. However, as can be seen in Table [673] the r m j n which provides 
the minimum number of tasks does not provide the least amount of computational 
cost. The computational costs will be minimized at the full recursion. 

Even though the number of tasks and thereby the computational complexity are 



minimized for r m i n , Table [6T4] shows that the critical path length increases exponen- 
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Table 6.3: 128 x 128 tiles of size n b = 200 



algorithm 


recursion 


tasks 


(ornop 




1 


1,896,448 


2.92e+04 




2 


1,774,592 


2.56e+04 




3 


1,762,048 


2.24e+04 


strassen.winograd 


4 


1,915,712 


1.96e+04 




5 


2,338,288 


1.72e+04 




6 


3,212,252 


1.51e+04 




7 


4,859,338 


1.33e+04 


tiled_DGEMM 




4,177,920 


3.36e+04 



Table 6.4: Comparison of the total number of tasks and critical path length for matrix 
of p x p tiles. 



p 


r 


# tasks 


CP 


Ratio 







64 


2 


32.0 


4 












1 


116 


7 


16.6 







512 


3 


170.7 


8 


1 


688 


9 


76.4 




2 


1,052 


52 


20.2 







4,096 


4 


1,023.5 




1 


4,544 


13 


349.5 


16 










2 


5,776 


66 


87.5 




3 


8,324 


361 


23.1 







32,768 


5 


6,553.6 




1 


32,512 


21 


1,548.2 


32 


2 


35,648 


94 


379.2 




3 


44,272 


459 


96.5 




4 


62,108 


2,524 


24.6 







262,144 


6 


43,690.7 




1 


244,736 


37 


6,614.5 


64 


2 


242,944 


150 


1,619.6 




3 


264,896 


655 


404.4 




4 


325,264 


3,210 


101.3 
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tially with the recursion levels. Thus the amount of parallelism is likewise reduced 
for each recursion level. 

6.3 Related Work 

In practice the Strassen-Winograd algorithm imparts a large amount of overhead 
for small matrices, thus it is customary to overcome this issue by using it in conjunc- 
tion with a conventional matrix multiplication operation. The key idea is to provide 
a recursion cutoff point such that once this is reached, the algorithm switches from 
calling itself recursively to calling, e.g., the BLAS Level 3 matrix multiplication. The 
recursion cutoff point is a tuning parameter which depends upon the machine archi- 
tecture and can be either dynamic or static. In [49j, one level of recursion is used 
while Chou [2T] provides two levels of recursion by explicitly coding the 49 matrix 
multiplications which then is processed by either 7 or 49 processors. Our approach is 
to provide the recursion cutoff point as a parameter which can be set by the user. 

There are various methods which can be used to deal with non square matrices 
and/or matrices of odd order. Methods such as static padding, dynamic padding, 
and dynamic peeling all provide these mechanisms. In this chapter, we only study 
matrices of tile order p = 2 k , i.e., n = 2 k ■ n^. 

In [T7j, Boyer et al. propose schedules for both in-place and out-of-place imple- 
mentations without the need for extra computations. Discovery of these algorithms 
was accomplished by using an exhaustive search algorithm. Their out-of-place algo- 
rithm makes use of the resultant matrix for temporary storage for the intermediate 
computations. This introduces unwanted data dependencies and more data move- 
ment leading to a loss of parallelism. Hence, we do not consider their out-of-place 
algorithm and focus on the classical Strassen-Winograd algorithm by using temporary 
storage allocations for all of the intermediate computations. As an example, given 
a tiled matrix of 128 x 128 tiles of size 200 x 200, the input matrices and resultant 

matrix require 1.96608 GB of storage and allowing full recursion for our algorithm 
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Algorithm 6.2: Tiled Strassen-Winograd (tiled_gesw) 

/* p is equal to the recursion cutoff, do the multiplication */ 

1 if p = r then 

2 i tiled_gemm( p, A,B,C ) 

3 else 

/* p is greater than the recursion cutoff, so we split the 





problem in half 


4 


p = p/2 


5 


/* Phase 1 


6 


tiled_geadd( p, A 21 , A 22) 7\ ) 


7 


tiled_geadd( p, T u A n ,T 2 ) 


8 


tiled_geadd( p, A u , A 21 , T 3 ) 


9 


tiled_geadd( p,A 12 ,T 2 ,T 4 ) 


10 


tiled_geadd( p, B 12 , B u , T 5 ) 


11 


tiled_geadd( p,B 22 ,T b ,T 6 ) 


12 


tiled_geadd( p, B 22 , B 12 , T 7 ) 


13 


tiled_geadd( p,T 6 ,B 21 ,T 8 ) 


14 






/* Phase 2 


15 


tiled_gesw( p,T 2 ,T 6 ,Q 1 ) 


16 


tiled_gesw( p, A n ,B u , Q 2 ) 


17 


tiled_gesw( p, A 12 , B 21 , Q 3 ) 


18 


tiled_gesw( p, T 3 , T 7 , Q 4 ) 


19 


tiled_gesw( p,T ± ,T 5 ,Q 5 ) 


20 


tiled_gesw( p, T 4 , B 22 , Q 6 ) 


21 


tiled_gesw( p, A 22 ,T 8 , Q 7 ) 


22 






/* Phase 3 


23 


tiled_geadd( p, Q 1 , Q 2 , U 1 ) 


24 


tiled_geadd( p, U\,Q^ U\ ) 


25 


tiled_geadd( p, Q 5 , Q 6 , Ui ) 


26 


tiled_geadd( p, Q 2 , Q 3 , C n ) 


27 


tiled_geadd( p, U 3 , C 12 ) 


28 


tiled_geadd( p, U 2 , Q 7 , C 2i ) 


29 


tiled_geadd( p, U 2 ,Q 5 , C 22 ) 



*/ 



*/ 
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would require an additional 3.2766 GB of temporary storage (see Figure 6.3) 



Memory Allocation for temporary storage 
n = 25600, p = 128 



Figure 6.3: Required extra memory allocation for temporary storage for varying re- 
cursion levels. 



In [27] a sequential implementation of Strassen-Winograd is studied by Douglas 
et al. which also provides means for a hybrid parallel implementation where the lower 
level is the sequential Strassen-Winograd and the upper level algorithm is the standard 
subdivided matrix multiplication. Comparisons are made to sequential implementa- 
tions available in the IBM's ESSL and Cray's Scientific and Math library. Although 
this would have been an interesting project in and of itself within a tiled framework, 
the emphasis in this chapter was to provide the Strassen-Winograd at the upper level. 

After this work was completed, Ballard et al. published work which places 
Strassen-Winograd in a parallel distributed environment [9]. The paper is well writ- 
ten and they do show improvement over the standard algorithm for matrices of order 
over 94,000. Their algorithm is communication optimal and applies better to the 
distributed environment than the shared memory environment seeing that we do not 
have as much control over the memory distribution. 

6.4 Experimental results 

All experiments were performed on a 48-core machine composed of eight hexa- 

core AMD Opteron 8439 SE (codename Istanbul) processors running at 2.8 GHz. 
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Each core has a theoretical peak of 11.2 Gflop/s with a peak of 537.6 Gflop/s for 
the whole machine. The Istanbul micro-architecture is a NUMA architecture where 
each socket has 6 MB of level-3 cache and each processor has a 512 KB level-2 cache 
and a 128 KB level-1 cache. After having benchmarked the AMD ACML and Intel 
MKL BLAS libraries, we selected MKL (10.2) since it appeared to be slightly faster 
in our experimental context. Using MKL, for DGEMM each core has a peak of 9.7 
Gflop/s with a peak of 465.6 Gflops/s for the whole machine. Linux 2.6.32 and Intel 
Compilers 11.1 were also used in conjunction with PLASMA 2.3.1. 

The parameter for tile size has a direct effect on the amount of data movement 



and the efficiency of the kernels. Figure 6.4a presents the performance comparison 
for varying tile sizes indicating = 200 as the most efficient. As expected, it is 
also evidenced that the efficiency of the Strassen-Winograd algorithm is dependent 
upon the efficiency of the tiled DGEMM. When running on 48 threads, increasing the 



recursion level decreases the performance of the algorithm as depicted in Figure 6.4 
(r = is the tiled matrix- matrix multiplication). 



Comparison ot tile size on 12 threads, n = 6400 




200 400 600 800 



■ i tiled_dgemm 
1200 1400 



(a) Tile size comparison 




(b) Recursion level comparison 



Figure 6.4: Comparison of tuning parameters and r. 



Our implementation allows for the tuning of the recursion level and can range 



from one recursion up to full recursion. In Figure |6.4b[ matrices of 64 x 64 tiles are 
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used and the recursion level ranges from one to five. Although r min = 2 for 64 x 64 
tiles, the best performance using 48 threads is seen at r = 1. This is due to the 
amount of parallelism lost by having the critical path length increase as the recursion 
level increases which offsets any gains of the reduction in tasks, and computational 
cost and complexity. 



Scalability comparison 
n = 128D0,p = 64 



Elficiency comparison 
n = 12800, p = 64 



- strassen_winograd, r = 1 

- tiled_dgemm 
-GEMM 




(a) Scalability comparison 



(b) Efficiency comparison 



Figure 6.5: Scalability and Efficiency comparisons on 48 threads with matrices of 
64 x 64 tiles and n b = 200. 



Figures 6.5a and 6.5b illustrate the performance and efficiency reached by the 
Strassen-Winograd implementation, with r = 1, as compared to the multithreaded 
MKL DGEMM and the tiled DGEMM implementation. The Strassen-Winograd im- 
plementation outperforms the tiled DGEMM up to the point where we loose par- 
allelism. However, both show sub-par performance when compared to the multi- 
threaded MKL implementation. 

If we run on 12 cores (typical current architecture for a node) then we do outper- 



form tiled DGEMM (Algorithm 6.1 ) if a loss of parallelism is not a factor and there is 



not too much data movement, i.e., keep small enough so that more tiles fit into the 
cache but large enough to retain the efficiency of the GEMM kernel. Depicted in Fig- 
ure 



6.6, the performance of the Strassen-Winograd algorithm is best when r = r r 
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(a) Scalabilty comparison 



(b) Efficiency comparison 



Figure 6.6: Scalabilty and efficiency comparisons executed on 12 threads with ma- 
trices of 64 x 64 tiles and nj = 200. 



since the number of tasks and computational complexity is minimized which reflects 



the analysis of § 6.2 



6.5 Conclusion 

In this chapter we have shown and analyzed an implementation of the Strassen- 
Winograd algorithm in a tiled framework and provided comparisons to the multi- 
threaded MKL standard library as well as a tiled matrix multiplication. The interest 
in this implementation is that it can support any level of recursion and any level of 
parallelism through the use of the recursion and tile size parameters. 

Albeit that our implementation did not perform as well as the highly tuned 
and optimized multithreaded MKL library, on 12 cores with 2 levels of recursion, 
its performance was only lower by about 2%. Ultimately, it is a formidable task 
to surpass the MKL implementation considering the computational complexity of a 
recursive tiled algorithm. 



96 



7. Conclusion 

In this thesis, we have studied the tiled algorithms both theoretically and exper- 
imentally. Our aim was to alleviate the constraints of memory boundedness, task 
granularity, and synchronicity imposed by the LAPACK library as brought to light 
in the Cholesky Decomposition example in the introduction. Moreover, we have also 
detailed that one may translate the LAPACK algorithms directly to tiled algorithms 
while in other cases a new approach may provide better performance gains. 

In the study of the Cholesky Inversion, the tiled algorithms provided a unique 
insight into the interaction of the three distinct steps involved in the algorithm and 
how these may be intertwined. We had observed that the choice of the variant in 
the inversion of the triangular factor (Step 2) has a great impact on the performance 
of the algorithm. In the scope of unlimited number of processors, this choice can 
lead to an algorithm which performs the inversion of the matrix in almost the same 
amount of time it takes to do the Cholesky factorization. Moreover, we note that 
the combination of the variants with the shortest critical path length does not trans- 
late into the best performing pipelined algorithm. Even though variant 3 of Step 2 
(the triangular inversion) did not provide us with the shortest critical path length 
within itself, combined with the other two steps, it does provide a Cholesky Inversion 
algorithm that performed the best overall. 

We have also observed that a simple translation from the LAPACK routines may 
not provide a tiled algorithm with the best performance. We showed that loop- 
reversals are needed to alleviate anti-dependencies which negatively affect the perfor- 
mance of the tiled algorithms. 

In the case where a tiled algorithm already existed, namely the QR Factorization, 
we made use of a new tiled algorithm to improve upon performance. These algorithms 
exhibit more parallelism than state-of-the-art implementations based on elimination 
trees. Using ideas from the 1970/80's, we have theoretically shown that the new 
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algorithm, GrASAP, is optimal in the scope of unbounded number of processors. 
We have provided accurate estimations for the length of the critical paths for all 
of these new tiled algorithms and have provided explicit formulas for some of the 
algorithms. 

In the framework of bounded number of processors, our theoretical work has af- 
forded a new bound which more accurately reflects the performance expectations of 
the tiled algorithms. In the case of the Cholesky Factorization, the schedule produced 
using the Critical Path Method proved to be within seven percent of the ALAP De- 
rived bound indicating that this scheduling strategy is well suited for this application. 
The ALAP Derived bound has also been used as a tool to show that the optimality 
in the scope unbounded number of processors does not translate to optimality in the 
scheduling on a bounded number of processors. 

Overall, the theoretical and experimental portions of this thesis give credence to 
the impact of tiled algorithms for multi/many-core architectures. 
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APPENDIX A. Integer Programming Formulation of Tiled QR 
A.l IP Formulation 

We formulate the problem in question using integer programming. An equivalent 

binary programming formulation was also constructed (with time as an additional 
index), but we found the integer formulations were more quickly solved. 

Let i, j, and h denote rows (ranging 1, . . . ,p); k, I, l\ denote columns (ranging 
1, . . . , q); r, s, and t denote time steps (ranging 1, . . . , T). The upper bound on the 
number of time steps, (T), may come from any existing algorithm (greedy, ASAP, 
GRASP, etc.). Let the decision variables be defined as follows: 

A. 1.1 Variables 

Let all t be an integer ranging from zero to T denoting the time we complete the 
following tasks (and t — means the task is never performed). 

• UNMQR: Complete applying the reflectors from GEQRT across the row. (Re- 
quires 6 units of time.) 

Wiki — t G [0, T] : if we finish the update of tile (i, k) at time t. (A.l) 
(This update was necessitated bjxu = s : I < k, s < t). 

• GEQRT: Factor a square tile into a triangle. (Requires 4 units of time.) 

Xik = t G [0, T] : if we complete triangularization of tile (i, k) at 

(A.2) 

the end of time unit t. 

• TTMQR: Update the entries in two rows after TTQRT. (Requires 6 units of 
time.) 

Uijki — t G [0, T] : if we finish the update of tile (i, k) and (j, k) at 

(A.3) 

the end of time unit t. 
(This update was necessitated by Ziji — s : I < k,s < t) 
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• TTQRT: Cancel one triangular tile using another triangle tile. (Requires 2 
units of time.) 

Zijk — t G [0, T] : if we complete zeroing tile (i, k) using tile (J, k) 

(A.4) 

at the end of time unit t. 

For the y and z actions, it is useful to have a binary variable which is 1 if the action 
occurs. Explicitly, 

{1 : if yijki > 1 : if z ijk > 

z^k = < (A.5) 
: otherwise : otherwise 

A. 1.2 Constraints 

I. Time constraints for each of the four actions: 

(a) Time for Wiki 

i. Wiki must occur at least 3 time steps after earlier w updates. 

Wiki > Wikh + 3 V k > 2, % > I, k < I < k (A. 6) 

ii. Wiki must occur at least 3 time steps after earlier y updates. 

Wiki > Vijkh + Vjikh +3 V k>2,i>l,j,l 1 <l <k (A.7) 

iii. Wiki must occur at least 3 time steps before y updates in the current 
column. 

Wiki + 3 < y ijk i + yjiki + (1 - Vijki ~ Vjiki)T V i, j, I < k, k > 2 (A.8) 

iv. Wiki must occur at least 2 time steps before x. 

Wiki + 2 <x ik V i > k > 2, / < k (A. 9) 

v. Wiki must occur at least 1 time step before z actions. 

Wiki + 1 < z ijk + Zjik + (1 - z^k - Zj ik )t V i,j,k>2,l < k (A. 10) 
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(b) Time for 

i. Xik must occur 2 time steps after any w updates. (Identical to equation 



(A.9). 



ii. Xik must occur 2 time steps after any y updates. 

x ik > ym + Vjiki +2 Vj,l,i>k,l < k (A. 11) 

iii. Xik must occur 1 time steps before any z action. 

x ik + 1 < z ijk + zjik + (1 - z ijk - ZjikjT V i> k,j (A. 12) 

(c) Time for y ijH 

i. yijki must occur 3 time steps after w updates originating from the same 



column. (Identical to equation A. 8 



ii. y^ki must occur 2 time steps before any Xik or Xjk- (Identical to 
equation 



A.ll 



iii. yijki must be 3 time steps before or after any y action involving rows 
i or j. 

We must enforce 

Uijkl + Vjikl + 3 < yhikl + Uihkl + (1 — Vhikl — Vihkl)T 

or 

Vhikl + Uihkl + 3 < yijki + Ujikl + (1 — Vijkl — Vjikl)T 



And 



Vijkl + Vjikl + 3 < Vhjkl + Vjhkl + (1 — Vhjkl ~ Vjhkl)T 



or 



Vhjkl + Vjhkl + 3 < Vijkl + Vjikl + (1 — Vijkl — Vjikl)T 
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We define the binary variables 8\ ijW 8 2 hijW 8\ ijW and 5fc jW and include 
the disjunctive constraints 

Vijki + Vjiki + 3 < yhiki + Uihki + (1 - Vuki - Vihki)T + 5\ ijkl T (A.13) 

Vh,i,j,l <k>2 

Vhiki + Vihki + 3 < yijki + Vjiki + (1 - Vijki ~ VjikijT + SlijkiT ( A - 14 ) 

Vh,i,j,l <k>2 

<%ra + *2yH>l Vh,i,j,l<k>2 (A.15) 

and 

J/tjfti + 2/jifc« + 3 < 2/ hj - w + y jfM + (1 - - y jh ki)T + 8 3 hijkl T (A. 16) 

Vh,i,j,l <k>2 

yhjki + yjhki + 3 < y ijU + y jM + (1 - jfoju - %ifei) T + $hijki T ( A - 17 ) 
Vh,i,j,l <k>2 



+ St m >l Vh,i,j,l<k>2 



J hijkl ' u hijkl 



(A.18) 



iv. yijki must occur 3 time steps before any z action involving rows i or j. 

Vijki + 3 < z M k + Zihk + (1 - 4ifc - z ihk )T V h,i,j,k > l,k > 2 

IfijjM + 3 < + z jfefc + (1 - z h jk ~ Zjhk)T V h,i,j,k > l,k >2 

(A.19) 

(d) Time for z^ 

i. Zij k must occur 1 time step after any w action. (Identical to equation 



A. 10 



ii. Zijk must occur 1 time step after any y action. (Identical to equation 
|A~19l ) 



iii. z^k must occur 1 time step after any x action. (Identical to equation 



A.12 
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iv. Zij k must occur 1 time step before or after any other z action. 

A. Case 1. We use (i, k) to zero (J, k) and (i, k) to zero (h, k) 
We need to enforce 

Zjik + 1 < z hik + (1 — z hik )T 
or 

Zhik + 1 < Zj ik + (1 — Zjik)T 

To do this, we define binary variables 5\^ k and 5^ k and include 
the disjunctive constraints as follows. 

Zjik + 1 < + (1 - + S lijk T v ^> *> h k (A.20) 

^ifc + 1 < Zjik + (1 - %fc)T + <Jh - fc T V /i, ijj, k (A.21) 
^ fc + ^>l V/i,i,j,fc (A.22) 

B. Case 2. We use (i, k) to zero (j, k) and (/i, fc) to zero (i, k) We 
need to enforce 

Zjik + 1 < z ihk + (1 - ^ fe )T V h, k(i > jV.) (A.23) 

2. A tile can't zero itself, (and hence we can't update just a single row afterwards). 

zak = V i, k y iik i = V i, k, I (A.24) 

3. Both tiles involved in TTQRT must be triangles before one can zero another 

Xik < (1 - z ijk )T + z^k V i,j, k, x jk < (1 - %fc)T + V i, j, fc (A.25) 

4. Force updates after triangle and zeroing actions. 

(a) After a tile is triangularized, updates must occur in the next column. 

Xik < wuk — 3 V i,k < q,i > k,l > k (A. 26) 
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(b) After a tile is zeroed, updates must occur in the next column. 

Zijk < (Vijik + Vjiik) Vi,j,h<l (A.27) 

5. The updates of (i, k) (arising from pivot /) from triangularizing must occur 
before updates from zeroing involving (i, k) (also arising from pivot I). 

Wiki < (1 - Vijki - Vjiki)T + yijki + Djiki V i,j,k > I (A.28) 

6. No updates to a tile can occur after triangularization. 

Xik > w ik i V i>k,k> I x ik > y ijk i + y jM V i > k,j,k > I (A.29) 

7. After a tile (i, k) is zeroed, we can't use it to zero. 

z^k > z hik V h, i, j, k (A.30) 

8. Tiles on or below the diagonal must be triangularized at some point, (and we 
can't finish a triangularization until time step 2.) 

x ik > 2 Vi>k (A.31) 

9. Tiles strictly below the diagonal must be zeroed at some point. 

^% fe = l \/i>k (A.32) 
j 

10. Can't triangularize above the diagonal. 

x ik = V i<k (A.33) 

11. Force binary variables 

Vijki < yijki yijki *T > y ijkl k, I (A.34) 

Zijk ^ z^k z^k * T > Zijk ^hii k (A. 35) 
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A. 1.2.1 Precedence constraints 

The most cumbersome constraints to formulate are those forcing the correspond- 
ing TTQRT and TTMQR operations to be executed in the same order. 

For each tile (i, k) involved in a zeroing process with (J, k) and (h, k), the order 
of the updates must follow the order of the zeroing processes. We want < Zihk 
(for some h) or z jik < z hik (for some h), to force y^fc+i)* < Uih(k+i)k- 



1. a 1 



a hijk — \ 



1 : if we [use (i, k) to zero (h, k)} and also [use (i, k) 

to zero (j, k) ] (AM) 
: otherwise 



2. a 2 



h hijk 



3. b 



a hijk — ^hik 



O-hijk — %ifc 

1 : if we [use (/i, k) to zero (i, fc)] after [using (i, k) to 

zero (j, fc)] 
: otherwise 



a hiik — ^ihk 



l hijk 
,2 



bhii 



hijk 



a hijk — %fc 

1 : if [z hik > zjik] regardless of whether [z hik = 0] 
: otherwise 



(A.37) 



(A.38) 



(A.39) 



(A.40) 
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^^hijk — Zfiik Zjik 
T(b h ijk — 1) < Z hik — Z jik 



(A.41) 



4. c 1 



^hijk 



: otherwise 



(A.42) 



,i 



Chijk — a hijk 
C hijk — bhijk 
z hijk + 1 — a Ljfe + ^ 



(A.43) 



5. c 



Chijk 



hijk 



if zeroing actions in column of rows h and « to 
occur after the zeroing actions of rows i and j 
otherwise 



(A.44) 



(A.45) 



Chijk > c\ ijk 

Chijk > a hijk 

Chijk < c\ ijk + a^ jfe 

So we now have a variable c^jk that is 1 when updates of row h and i must come 
before the updates of i and j. We can define similar variables for the updates 
rather than the zeros. 



6. d 



''hijk 







if we [update (i, k) and (h, k) together] and also 
[update (i,k) and (j,k) together]. (All updates 
occur because of zeroing in column k — 1) 
otherwise 



(A.46) 



Ahijk < Vhik(k-l) + Vihk(k-1) 
dhijk < Vjik{k~l) + Vijk(k-l) 

dhijk + 1 > Vhik(k-i) + Vihk(k-i) + Vjik(k-i) + Vijk(k-i) 



(A.47) 
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&hijk < 



1 : if updates in column k of rows h and i to occur 
after updates of i and j or the updates in i and j 
never happen 

: otherwise 



Tehijk > (yhik(k-l) + yihk(k-l)) ~ (yjik(k-l) + Vijk(k-l)) 

T(e hi j k — 1) < (yhik(k-i) + Uihk(k-i)) - (yjik(k-i) + Uijk(k-i)) 



8. / 



fi 



hijk 



1 : if the updates in column k of rows h and i to occur 

after updates of i and j 
: otherwise 



fhijk ^ dfrijk 
fhijk ^ &hijk 
fhijk — dfiijk ~\~ Ghijk 

Lastly, to force the updates in order: 

Chijk < fhiji Vh,i,j,k<l 

A. 1.3 Objective function 

Let total_time be a variable such that 

total lime > Wiki Vi,k,l 

total lime > Xik Vi, k 
total lime > y^ki Vi, j, k, I 

total lime > Zij k Vi, j, k 



Then our objective function is: 



min total lime 
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